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Abstract 

We consider the fundamental problem of solving quadratic systems of equations in n variables, where 
yi = |(aj, x)\ 2 , i = 1 , ... ,m and x £ R™ is unknown. We propose a novel method, which starting with an 
initial guess computed by means of a spectral method, proceeds by minimizing a nonconvex functional as 
in the Wirtinger flow approach if|. There are several key distinguishing features, most notably, a distinct 
objective functional and novel update rules, which operate in an adaptive fashion and drop terms bearing 
too much influence on the search direction. These careful selection rules provide a tighter initial guess, 
better descent directions, and thus enhanced practical performance. On the theoretical side, we prove 
that for certain unstructured models of quadratic systems, our algorithms return the correct solution in 
linear time, i.e. in time proportional to reading the data {(n} and {yi} as soon as the ratio m/n between 
the number of equations and unknowns exceeds a fixed numerical constant. We extend the theory to 
deal with noisy systems in which we only have yi « |(ai,a:}| 2 and prove that our algorithms achieve a 
statistical accuracy, which is nearly un-improvable. We complement our theoretical study with numerical 
examples showing that solving random quadratic systems is both computationally and statistically not 
much harder than solving linear systems of the same size—hence the title of this paper. For instance, we 
demonstrate empirically that the computational cost of our algorithm is about four times that of solving 
a least-squares problem of the same size. 


1 Introduction 

1.1 Problem formulation 

Imagine we are given a set of m quadratic equations taking the form 

yi = \(a i7 x)\ 2 , i = 1, - ■■ , to , ( 1 ) 

where the data y = [y»]i<i< m and design vectors £ IV/C" are known whereas x £ K n /C n is unknown. 
Having information about \{a i: x )\ 2 — or, equivalently, \(a,i,x)\ -means that we a priori know nothing about 
the phases or signs of the linear products {a^x). The problem is this: can we hope to identify a solution, if 
any, compatible with this nonlinear system of equations? 

This problem is combinatorial in nature as one can alternatively pose it as recovering the missing signs 
of (a,i,x) from magnitude-only observations. As is well known, many classical combinatorial problems with 
Boolean variables may be cast as special instances of 0 - As an example, consider the NP-hard stone 
problem [2] in which we have n stones each of weight Wi > 0 (1 < i < n), which we would like to divide into 
two groups of equal sum weight. Letting Xi £ {—1,1} indicate which of the two groups the ith stone belongs 
to, one can formulate this problem as solving the following quadratic system 

hi = 1, i = l,---,n, 

[ {w\Xi -I - b W n x n ) 2 = 0. 
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However simple this formulation may seem, even checking whether a solution to (J2]) exists or not is known 
to be NP-hard. 

Moving from combinatorial optimization to the physical sciences, one application of paramount impor¬ 
tance is the phase retrieval |3j|4j problem, which permeates through a wide spectrum of techniques including 
X-ray crystallography, diffraction imaging, microscopy, and even quantum mechanics. In a nutshell, the 
problem of phase retrieval arises due to the physical limitation of optical sensors, which are often only able 
to record the intensities of the diffracted waves scattered by an object under study. Notably, upon illu¬ 
minating an object x , the diffraction pattern is of the form of Ax ; however, it is only possible to obtain 
intensity measurements y = |Aa;| 2 leading to the quadratic system 00 In the Fraunhofer regime where 
data is collected in the far-field zone, A is given by the spatial Fourier transform. We refer to j5] for in-depth 
reviews of this subject. 

Continuing this motivating line of thought, in any real-world application recorded intensities are always 
corrupted by at least a small amount of noise so that observed data are only about \(a,i,x}\ 2 \ i.e. 

Vi ~ \ {ai,x }\ 2 , i = 1, - ■■ ,m. (3) 

Although we present results for arbitrary noise distributions—even for non-stochastic noise—we shall pay 
particular attention to the Poisson data model, which assumes 

Vi Poisson(| (a t ,x)\ 2 ), i = l,---,m. (4) 

The reason why this statistical model is of special interest is that it naturally describes the variation in the 
number of photons detected by an optical sensor in various imaging applications. 


1.2 Nonconvex optimization 

Under a stochastic noise model with independent samples, a first impulse for solving 0 is to seek the 
maximum likelihood estimate (MLE), namely, 

E ttl 

£(z' i y i ), (5) 

i =1 

where £ (z; yi) denotes the log-likelihood of a candidate solution 2 given the outcome ip. For instance, under 
the Poisson data model Q one can write 

= Vi l°g(|a*z| 2 ) - \a*z\ 2 (6) 


modulo some constant offset. Unfortunately, the log-likelihood is usually not concave, thus making the 
problem of computing the MLE NP-hard in general. 

To alleviate this computational intractability, several convex surrogates have been proposed that work 
particularly well when the design vectors {a;} are chosen at random [6 20 . The basic idea is to introduce 
a rank-one matrix X = xx* to linearize the quadratic constraints, and then relax the rank-one constraint. 
Suppose we have Poisson data, then this strategy converts the problem into a convenient convex program: 


minimize* YnLi (M» - Vi log Mi) + ATr(X) 
subject to pi = ajXa i: 1 < i < m, 

x y o. 


Note that the log-likelihood function is augmented by the trace functional Tr(-) whose role is to promote low- 
rank solutions. While such convex relaxation schemes enjoy intriguing performance guarantees in many as¬ 
pects (e.g. they achieve minimal sample complexity and near-optimal error bounds for certain noise models), 
the computational cost typically far exceeds the order of n 3 . This limits applicability to large-dimensional 
data. 

This paper follows another route: rather than lifting the problem into higher dimensions by introducing 
matrix variables, this paradigm maintains its iterates within the vector domain and optimize the nonconvex 

1 Here and below, for z S C n , \z\ (resp. \z\' 2 ) represents the vector of magnitudes (\z\ , • • • , z n |) ' (resp. squared magnitudes 

(kil 2 ,-.. ,M 2 ) t ). 
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objective directly (e.g. [l][3j |2l] - f28] ). One promising approach along this line is the recently proposed two- 
stage algorithm called Wirtinger Flow (WF) [l]. Simply put, WF starts by computing a suitable initial 
guess z0) using a spectral method, and then successively refines the estimate via an update rule that bears 
a strong resemblance to a gradient descent scheme, namely, 


(* + D = z (t) + — V ™ 
m i 


where z <i:> denotes the tth iterate of the algorithm, and y, t is the step size (or learning rate). Here, V£(z; yi) 
stands for the Wirtinger derivative w.r.t. z, which in the real-valued case reduces to the ordinary gradient. 
The main results of jT] demonstrate that WF is surprisingly accurate for independent Gaussian design. 
Specifically, when a, ~ 7V(0, 1) or oq ~ 7V(0, 1) + jJ\f (0, 1): 

1. WF achieves exact recovery from m = O (nlogn) quadratic equations when there is no noise0 

2. WF attains e-accuracy—in a relative sense—within 0(mn 2 3 log(l/e)) time (or flops); 

3. In the presence of Gaussian noise, WF is stable and converges to the MLE as shown in |29|. 

While these results formalize the advantages of WF, the computational complexity of WF is still much larger 
than the best one can hope for. Moreover, the statistical guarantee in terms of the sample complexity is 
weaker than that achievable by convex relaxations^] 


1.3 This paper: Truncated Wirtinger Flow 

This paper develops an efficient linear-time algorithm, which also enjoys near-optimal statistical guarantees. 
Following the spirit of WF, we propose a novel procedure called Truncated Wirtinger Flow (TWF) adopting 
a more adaptive gradient flow. Informally, TWF proceeds in two stages: 

1. Initialization: compute an initial guess z^ 0 ^ by means of a spectral method applied to a subset To of 
the observations {yi}; 

2. Loop: for 0 < t < T, 

z (t+i ) =z W + tL Y W(zW;j/i) (7) 

m . 

^£7t+i 

for some index subset 7t+i C {1, • • • , m) determined by z^K 
Three remarks are in order. 

• Firstly, we regularize both the initialization and the gradient flow in a data-dependent fashion by 
operating only upon some iteration-varying index subsets 7). This is a distinguishing feature of TWF 
in comparison to WF and other gradient descent variants. In words, 7t corresponds to those data {yi} 
whose resulting spectral or gradient components are in some sense not excessively large; see Sections 
[2] and [3] for details. As we shall see later, the main point is that this careful data trimming procedure 
gives us a tighter initial guess and more stable search directions. 

• Secondly, we recommend that the step size /it is either taken as some appropriate constant or determined 
by a backtracking line search. For instance, under appropriate conditions, we can take /it = 0.2 for all 

t. 

2 The standard notation f(n) = O (g(n)) or /(n) < g{n) (resp. /(n) = S7 (g(n)) or /(n) > g(n)) means that there exists a 

constant c > 0 such that |/(n)| < c\g(n) (resp. \ f(n)\ > c|g(n)|). J'(n) <?(n) means that there exist constants ci,C 2 > 0 such 

that ci|p(n)| < |/(n)| < c 2 |s(n)|. 

3 M. Soltanolkotabi recently informed us that the sample complexity of WF may be improved if one employs a better 
initialization procedure. 
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• Finally, the most expensive part of the gradient stage consists in computing Vl(z\yi), 1 < i < m, 
which can often be performed in an efficient manner. More concretely, under the real-valued, Poisson 
data model Q one has 


W(z;vi) = 2 


-a,a7 2 — a,-a7 2 


= 2 


Vi — \ a J z \ 2 


Thus, calculating {V£(z; yi )} essentially amounts to two matrix-vector products. Letting A := [ai, • • • , a m ] T 
as before, we have 


E V£(z^; yi ) = A T v, 
i£7t +1 



i € Tt+i, 

otherwise. 


Hence, Az gives v and A' v the desired regularized gradient. 
A detailed specification of the algorithm is deferred to Section [2j 


1.4 Numerical surprises 

To give the readers a sense of the practical power of TWF, we present here three illustrative numerical 
examples. Since it is impossible to recover the global sign — i.e. we cannot distinguish x from — x —we will 
evaluate our solutions to the quadratic equations through the distance measure put forth in 11] representing 
the Euclidean distance modulo a global sign: for complex-valued signals, 

dist (z,x) := min^.gjo^) \\e~ J,p z - ®||, (8) 

while it is simply min \\z ± x|| in the real-valued case. We shall use dist (a;, a:)/||a:|| throughout to denote the 
relative error of an estimate x. In the sequel, TWF proceeds by attempting to maximize the Poisson log- 
likelihood (|6|) . Standalone Matlab implementations of TWF are available at http : //statweb. Stanford. 
edu/~candes/publications.html (see |30] for straight WF implementations). 

The first numerical example concerns the following two problems under noiseless real-valued data: 

(a) find x £ R ra s.t. bi = ajx , 1 < i < m; 

(b) find x £ M" s.t. bi = \ajx\, 1 < i < m. 


Apparently, (a) involves solving a linear system of equations (or a linear least squares problem), while 
(b) is tantamount to solving a quadratic system. Arguably the most popular method for solving large- 
scale least squares problems is the conjugate gradient (CG) method J3l] applied to the normal equations. 
We are going to compare the computational efficiency between CG (for solving least squares) and TWF 
with a step size pt = 0.2 (for solving a quadratic system). Set m = 8 n and generate x ~ Af(0,I) and 
ai ~ Af (0 ,/), 1 < i < m, independently. This gives a matrix A T A with a low condition number equal to 
about (1 + v / 178) 2 /(1 — \J 1/8) 2 « 4.38 by the Marchenko-Pastur law. Therefore, this is an ideal setting for 
CG as it converges extremely rapidly 32 Theorem 38.5]. Fig. [l] shows the relative estimation error of each 


method as a function of the iteration count, where TWF is seeded through 10 power iterations. For ease of 
comparison, we illustrate the iteration counts in different scales so that 4 TWF iterations are equivalent to 
1 CG iteration. 

Recognizing that each iteration of CG and TWF involves two matrix vector products Az and A t d, for 
such a design we reach a suprising observation: 


Even when all phase information is missing, TWF is capable of solving a quadratic system of 
equations only about 4 times slower than solving a least squares problem of the same size! 

To illustrate the applicability of TWF on real images, we turn to testing our algorithm on a digital 
photograph of Stanford main quad containing 320 x 1280 pixels. We consider a type of measurements that 
falls under the category of coded diffraction patterns (CDP) |33 and set 

y {l) = | FD {l) x\ 2 , 1 < l < L. (9) 
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Figure 1: Relative errors of CG and TWF vs. iteration count. Here, n = 1000, m = 8 n, and TWF is seeded 
using just 10 power iterations. 


Here, F stands for a discrete Fourier transform (DFT) matrix, and D ^ is a diagonal matrix whose diagonal 
entries are independently and uniformly drawn from {1,-1 (phase delays). In phase retrieval, each 

D {]) represents a random mask placed after the object so as to modulate the illumination patterns. When L 
masks are employed, the total number of quadratic measurements is m = nL. In this example, L = 12 random 
coded patterns are generated to measure each color band (i.e. red, green, or blue) separately. The experiment 
is carried out on a MacBook Pro equipped with a 3 GHz Intel Core i7 and 16GB of memory. We run 50 
iterations of the truncated power method for initialization, and 50 regularized gradient iterations, which 
in total costs 43.9 seconds or 2400 FFTs for each color band. The relative error after regularized spectral 
initialization and after 50 TWF iterations are 0.4773 and 2.16 x 10 —5 , respectively, with the recovered images 
displayed in Fig. [2] In comparison, the spectral initialization using 50 untruncated power iterations returns 
an image of relative error 1.409, which is almost like a random guess and extremely far from the truth. 

While the above experiments concern noiseless data, the numerical surprise extends to the noisy realm. 
Suppose the data are drawn according to the Poisson noise model Q, with a; ~ Af (0,7) independently 
generated. Fig. [3] displays the empirical relative mean-square error (MSE) of TWF as a function of the 
signal-to-noise ratio (SNR), where the relative MSE for an estimate x and the SNR are defined a^ 

MSE := dlSt , ^ , and SNR := 3||ai|| 2 . (10) 

11*11 


Both SNR and MSE are displayed on a dB scale (i.e. the values of 10log 10 (SNR) and 10log 10 (rel. MSE) are 
plotted). To evaluate the accuracy of the TWF solutions, we consider the performance achieved by MLE 
applied to an ideal problem in which the true phases are revealed. In this ideal scenario, in addition to the 
data {ui] we are further given exact phase information {ipi = sign(a i r ®)}. Such precious information gives 
away the phase retrieval problem and makes the MLE efficiently solvable since the MLE problem with side 
information 

minimize*^ - Y?=i Vi lo g (l a 7 z ?) + ( a J z ) 2 

subject to (fit = sign(a) r z) 

can be cast as a convex program 


minimize^eRn 


-E™, 2 ^ lo S {<Pi a J z ) + ( a J z ) 2 - 

Z - *1 = 1 


Fig. [3] illustrates the empirical performance for this ideal problem. The plots demonstrate that even when all 
phases are erased, TWF yields a solution of nearly the best possible quality, since it only incurs an extra 1.5 


4 To justify the definition of SNR, note that the signals and noise are captured by //, = (aJ x) 2 and y, — //,, 1 < i < m, 


respectively. The ratio of the signal power to the noise power is therefore 


e-j _ Sfai 1° 




Var fei] Ejli K T “I 2 


3m||aj || 
m||aj||' 


= 3||a:|| 
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Figure 2: The recovered images after (a) spectral initialization; (b) regularized spectral initialization; and 
(c) 50 TWF gradient iterations following the regularized initiliazation. 


dB loss compared to ideal MLE computed with all true phases revealed. This phenomenon arises regardless 
of the SNR! 

1.5 Main results 

The preceding numerical discoveries unveil promising features of TWF in three aspects: (1) exponentially 
fast convergence; (2) exact recovery from noiseless data with sample complexity O(ro); (3) nearly minimal 
mean-square loss in the presence of noise. This paper offers a formal substantiation of all these findings. To 
this end, we assume a tractable model in which the design vectors Gq’s are independent Gaussian: 

ai ~Ar(0,I n ). (11) 

For concreteness, our results are concerned with TWF designed based on the Poisson log-likelihood function 

li(z) := l (z; yi ) = Vi log {\ajz\ 2 ) - \ajz | 2 , (12) 

where we shall use ii{z) as a shorthand for £(z; yi) from now on. We begin with the performance guarantees 
of TWF in the absence of noise. 
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Figure 3: Relative MSE vs. SNR in dB. The curves are shown for two settings: TWF for solving quadratic 
equations (blue), and MLE had we observed additional phase information (green). The results are shown 
for n = 100, and each point is averaged over 50 Monte Carlo trials. 

Theorem 1 (Exact recovery). Consider the noiseless case 0 with an arbitrary signal x £ R n . Suppose 
that the step size pt is either taken to be a positive constant p t = p or chosen via a backtracking line search. 
Then there exist some universal constants 0 < p, v < 1 and po, cq, c±, C2 > 0 such that with probability 
exceeding 1 — C\ exp (—c^m), the truncated Wirtinger Flow estimates (Algorithm^ with parameters specified 
in Table [7p obey 

dist(z^,s) < z^(l — / o) t ||£c||, Vf £ N, (13) 

provided that 

m > con and 0 < p < po- 
As explained below, we can often take po ~ 0.3. 

Remark 1. As will be made precise in Section [5] (and in particular Proposition |T|) , one can take 

= 0-994 - Ci - C 2 - 
/i0 2(1.02 + 0.665/^) 

for some small quantities Ci>C 2 and some predetermined threshold ah that is usually taken to be ah > 5. 
Under appropriate conditions, one can treat po as po ~ 0.3. 

Theorem [l] justifies at least two appealing features of TWF: (i) minimal sample complexity and (ii) 
linear-time computational cost. Specifically, TWF allows exact recovery from 0(n ) quadratic equations, 
which is optimal since one needs at least n measurements to have a well-posed problem. Also, because of the 
geometric convergence rate, TWF achieves e-accuracy (i.e. dist(z^), x) < e ||ae||) within at most O (log(l/e)) 
iterations. The total computational cost is therefore O (mn log(l/e)), which is linear in the problem size. 
These outperform the performance guarantees of WF [l], which runs in 0(mn 2 log(l/e)) time and requires 
0(71 log n) sample complexity. 

We emphasize that enhanced performance vis-a-vis WF is not the result of a sharper analysis, but 
rather, the result of key algorithmic changes. In both the initialization and iterative refinement stages, TWF 
proceeds in a more prudent manner by means of proper regularization, which effectively trims away those 
components that are too influential on either the initial guess or search directions, thus reducing the volatility 
of each movement. With a tighter initialization and better-controlled search directions in place, we take the 
step size in a far more liberal fashion—which is some constant bounded away from 0—compared to a step 
size which is 0(1/n) as explained in [l]. In fact, what enables the movement to be more aggressive is exactly 
the cautious choice of 7t, which precludes adverse effects from high-leverage samples. 

To be broadly applicable, the proposed algorithm must guarantee reasonably faithful estimates in the 
presence of noise. Suppose that 

Vi = \(ai,x}\ 2 + rji, 1 < i < m, (14) 
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where rji represents an error term. We claim that TWF is stable against additive noise, as demonstrated in 
the theorem below. 


Theorem 2 (Stability). Consider the noisy case m Suppose that the step size pt is either taken to be 
a positive constant p t = p or chosen via a backtracking line search. If 


m > Con , p < po, and 


\m 


< Cl ||£C| 


(15) 


then with probability at least 1 — C 2 exp (— c^m), the truncated Wirtinger Flow estimates (Algorithm^ with 
parameters specified in Ta&fe[7p satisfy 


dist(M*\ x) 


< 


11*711 


sjrh\\x\ 


+ (! - p) t ||*||, VteN 


simultanesouly for all x £ M n . Here, 0 < p < 1 and p 0 , cq, ci, C 2 , C 3 > 0 are some universal constants. 
Under the Poisson noise model one has 

dist (z^\x) < 1 + (1 — p) t ||:r||, Vf £ N 

with probability approaching one, provided that ||x|| > log 1 " 5 ™.. 


(16) 


(17) 


Remark 2. In the main text, we will prove Theorem [2] only for the case where x is fixed and independent 
of the design vectors {a,}. Interested readers are referred to the supplemental materials [34 for the proof 
of the universal theory (i.e. the case simultaneously accommodating all x £ R n ). Note that when there is 
no noise (77 = 0), this stronger result guarantees the universality of the noiseless recovery. 

Remark 3. [29] establishes stability estimates using the WF approach under Gaussian noise. There, the 

sample and computational complexities are still on the order of nlogn and mn 2 respectively whereas the 
computational complexity in Theorem [2] is linear, i.e. on the order of mn. 

Theorem 2 essentially reveals that the estimation error of TWF rapidly shrinks to O ( ) within 


logarithmic iterations. Put another way, since the SNR for the model (|14| is captured by 

|4 


SNR := 


£r=ii<< 


3m||®|| 


\\m n*7ir 

we immediately arrive at an alternative form of the performance guarantee: 

1 


dist(;z (t \ x) 


< 


a/SNR 


Ml+ (!-p) ll*ll> Vie N, 


(18) 


(19) 


revealing the stability of TWF as a function of SNR. We emphasize that this estimate holds for any error term 
rj —i.e. any noise structure, even deterministic. This being said, specializing this estimate to the Poisson 
noise model Q with ||*|| > log 1 ' 5 ™ gives an estimation error that will eventually approach a numerical 
constant, independent of n and to. 

Encouragingly, this is already the best statistical guarantee any algorithm can achieve. We formalize this 
claim by deriving a fundamental lower bound on the minimax estimation error. 

Theorem 3 (Lower bound on the minimax risk). Suppose that ~ J\f(0,I), to = nn for some fixed 
k independent of n, and n is sufficiently large. For any K > log 1 ' 5 ™, defin^\ 

T(K) := {x G | 11*11 G (1 ± 0.1)A'}. 

With probability approaching one, the minimax risk under the Poisson model w obeys 

inf sup E[dist (x,x) I {a ji<i< m ] > -^=, (20) 

x xd'T(K) V K 


where the infimum is over all estimator x. Here, £1 > 0 is a numerical constant independent of n and m. 


5 Here, 0.1 can be replaced by any positive constant within (0, 1/2). 










When the number m of measurements is proportional to n and the energy of the planted solution exceeds 
log 3 m, Theorem [ 3 ] asserts that there exists absolutely no estimator that can achieve an estimation error 
that vanishes as n increases. This lower limit matches the estimation error of TWF, which corroborates the 
optimality of TWF under noisy data. 

Recall that in many optical imaging applications, the output data we collect are the intensities of the 
diffractive waves scattered by the sample or specimen under study. The Poisson noise model employs the 
input x and output y to describe the numbers of photons diffracted by the specimen and detected by the 
optical sensor, respectively. Each specimen needs to be sufficiently illuminated in order for the receiver to 
sense the diffracted light. In such settings, the low-intensity regime ||a:|| < log 1 ' 5 m is of little practical 
interest as it corresponds to an illumination with just very few photons. We forego the details. 

It is worth noting that apart from WF, various other nonconvex procedures have been proposed as 
well for phase retrieval, including the error reduction schemes dating back to Gerchberg-Saxton and Fienup 


iterated projections [22] , alternating minimization [21], generalized approximate message passing [23], 
Kaczmarz method [35 , and greedy methods that exploit additional sparsity constraint [27 , to name just a 
few. While these paradigms enjoy favorable empirical behavior, most of them fall short of theoretical support, 
except for a version of alternating minimization (called AltMinPhase) [2l] that requires fresh samples for 
each iteration. In comparison, AltMinPhase attains e-accuracy when the sample complexity exceeds the 
order of nlog 3 n + ?rlog 2 nlog(l/e), which is at least a factor of log 3 n from optimal and is empirically 
largely outperformed by the variant that reuses all samples. In contrast, our algorithm uses the same set of 
samples all the time and is therefore practically appealing. Furthermore, none of these algorithms come with 
provable stability guarantees, which are particularly important in most realistic scenarios. Numerically, each 
iteration of Fienup’s algorithm (or alternating minimization) involves solving a least squares problem, and 
the algorithm converges in tens or hundreds of iterations. This is computationally more expensive than TWF, 
whose computational complexity is merely about 4 times that of solving a least squares problem. Interesting 


readers are referred to jT] for a comparison of several non-convex schemes, and 
alternative approaches (e.g. 36 37]) and performance lower bounds (e.g. (38 39 


33 for a discussion of other 


)• 


2 Algorithm: Truncated Wirtinger Flow 


This section describes the two stages of TWF in details, presented in a reverse order. For each stage, we 
start with some algorithmic issues encountered by WF, which is then used to motivate and explain the basic 
principles of TWF. Here and throughout, we let A : M nX ™ 1 —> R m be the linear map 

M e r x ” ^ A(M):={ajMa i } 1 ^ m 


and A the design matrix 


A := [ai, • • ■ 


2.1 Regularized gradient stage 

For independent samples, the gradient of the real-valued Poisson log-likelihood obeys 


E W ^) = 

i=1 



:=Vi 


( 21 ) 


where 1 \ represents the weight assigned to each Gq. This forms the descent direction of WF updates. 

Unfortunately, WF moving along the preceding direction might not come close to the truth unless z is 
already very close to cc. To see this, it is helpful to consider any fixed vector z € R" independent of the design 
vectors. The typical size of mini<j< m la^zl about on the order of — 11 z: 11, introducing some unreasonably 
large weights z/j, which can be as large as m||a:|| 2 /||z||. Consequently, the iterative updates based on (1.21 
often overshoot, and this arises starting from the very initial 


6 For complex-valued data where a, ~ J\f ( 0 , 1) + j]\f ( 0 , 1). WF converges empirically, as min; |a*z| is much larger than the 
real-valued case. 
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V n '' x x ' 
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\ \ >%> 






\ a T z \ 2 — \a T x \ 2 

Figure 4: The locus of — \ T \!li (z) = — ' y — a, when a* ranges over all unit vectors, where a: = (2.7, 8) 

and z = (3,6). For each direction a,;, — ^V£i(z) is aligned with Gq, and its length represents the weight 
assigned to this component. In particular, the red arrows depict a few directions that behave like outliers, 
whereas the blue arrows depict several directions whose resulting gradients take typical sizes. 


Fig. El illustrates this phenomenon by showing the locus of —V£» (z) when a; has unit norm and ranges 
over all possible directions. Examination of the figure seems to suggest that most of the gradient components 
'sJii (z) are more or less pointing towards the truth x and forming reasonable search directions. But there 
exist a few outlier components that are excessively large, which lead to unstable search directions. Notably, 
an underlying premise for a nonconvex procedure to succeed is to ensure all iterates reside within a basin 
of attraction, that is, a neighborhood surrounding x within which x is the unique stationary point of the 
objective. When a gradient is not well-controlled, the iterative procedure might overshoot and end up leaving 
this basin of attraction. This intuition is corroborated by numerical experiments under real-valued data. As 
illustrated in Fig. [5] the solutions returned by the WF (designed for a real-valued Poisson log-likelihood and 
m = 8 n) are very far from the ground truth. 

Hence, to remedy the aforementioned stability issue, it would be natural to separate the small fraction of 
abnormal gradient components by regularizing the weights i/*, possibly via data-dependent trimming rules. 
This gives rise to the update rule of TWF: 


z (t+ i) = *<*) + —Vetriz®), VieN, 

m 

where W tr (•) denotes the regularized gradient given bsQ 


ve„(z) :=J2 2 

1=1 


Vi — \ a i z \ 2 


a il£j(z)n£j(z)' 


( 22 ) 


(23) 


for some trimming criteria specified by £\ (•) and £\ (•). In our algorithm, we take £\ (z) and £\ (z) to be 
two collections of events given by 


£{{z ) := ja4 b < < a£ b j, 

£ 2 ( 2 ) : = z\ 2 \<^-\\y-A{zz r )\\ l 


(24) 

(25) 


where al b , a“ b , a z are predetermined thresholds. To keep notation light, we shall use £\ and £\ rather than 
£\ (z) and £\ (z) whenever it is clear from context. 


7 


TZ 


In the complex-valued case, 

? y i -\z*a i \ 2 

aiA q( z )nq(z)- 


the trimming rule is enforced upon the Wirtinger derivative, which reads V^tr (z) ■= 
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Figure 5: Relative MSE vs. SNR in dB. The curves are shown for WF and TWF, both employing the 
Poisson log-likelihood. Here, a; ~ Af(0,I), n = 1000, m = 8 n, and each point is averaged over 100 Monte 
Carlo trials. 


We emphasize that the above trimming procedure simply throws away those components whose weights 
z/j’s fall outside some confidence range, so as to remove the influence of outlier components. To achieve this, 
we regularize both the numerator and denominator of z^ by enforcing separate trimming rules. Recognize 
that for any fixed z , the denominator obeys 


E [|a7z|] = \/2jn\\z\\, 


leading up to the rule (241. Regarding the numerator, by the law of large numbers one would expect 

E [| - \ajz\ 2 \\ « — \\y - A (zz T ) || 
li u m n L 


and hence it is natural to regularize the numerator by ensuring 


\vi ~ \ a J z \ 2 \ 


< — II y — A (zz 1 

TO 


i i ' 


As a remark, we include an extra term | aj z|/||z|| in (25) to sharpen the theory, but all our results continue to 
hold (up to some modification of constants) if we drop this term in (251. Detailed procedures are summarized 
in Algorithm [D 

The proposed paradigm could be counter-intuitive at first glance, since one might expect the larger 
terms to be better aligned with the desired search direction. The issue, however, is that the large terms are 
extremely volatile and could have too high of a leverage on the descent directions. In contrast, TWF discards 
these high-leverage data, which slightly increases the bias but remarkably reduces the variance of the descent 
direction. We expect such gradient regularization and variance reduction schemes to be beneficial for solving 
a broad family of nonconvex problems. 


2.2 Truncated spectral initialization 


In order for the gradient stage to converge rapidly, we need to seed it with a suitable initialization, 
natural alternative is the spectral method adopted in |[T 21 


One 

which amounts to computing the leading 


8 Careful readers might note that we include some extra factor jj- (which is approximately 1 in the Gaussian model) in 
Algorithm |T| This occurs since we present Algorithm [l| in a more general fashion that applies beyond the model ai ~ A/*(0, /), 
but all results / proofs continue to hold in the presence of this extra term. 
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Algorithm 1 Truncated Wirtinger Flow. 

Input: Measurements {y, | 1 < i < to} and sampling vectors {a.; | 1 < i < to}; trimming thresholds al b , 
a“ b , ah, and a y (see default values in Table [lj|. 


Initialize z^ to be 


l/ E^TlKlP ^’ where = \Jm l Vi and 5 is tlie leading eigenvector of 

^ m 


(26) 


Loop: for t = 0 : T do 


(*)| 




TO ^ 

*=1 * 


where 


£(:=<! a lb < 


a. 


—< a u 

r(t)|| ~ a * (’ 


nb 


^2 : = i \Vi ~ \ a i z(t) \ 2 \ < a h K t - 


/n \a~z 


(01 


a,; 


sWl 


and K t := — Y] \y t - |o* 

m z ' 1 


*z w | 2 | 


Z=1 


Output zt- 


(27) 

(28) 


eigenvector of Y := A X)!=i yi a i a J■ This arises from the observation that when a.; ~ A/” (0, 1) and ||x|| = 1, 

E [Y] =I + 2xx r , 


whose leading eigenvector is exactly x with an eigenvalue of 3. 

Unfortunately, this spectral technique converges to a good initial point only when to > nlogn, due to 
the fact that ( ajx) 2 aiaj is heavy-tailed, a random quantity which does not have a moment generating 
function. To be more precise, consider the noiseless case yt = \aj x\ 2 and recall that max.; j/j ss 2 log to. 
Letting k = argmax;y;, one can calculate 


\ T 

\ -yr 

Ri iKli 


> 



2 n log to 

TO 


which is much larger than x T Yx = 3 unless m/n is very large. This tells us that in the regime where 
to. x n, there exists some unit vector afc/||ofe|| that is closer to the leading eigenvector of Y than x. This 
phenomenon happens because the summands of Y have huge tails so that even one large term could end 
up dominating the empirical sum, thus preventing the spectral method from returning a meaningful initial 
guess. 

To address this issue, we propose a more robust version of spectral method, which discards those obser¬ 
vations y.i that are several times larger than the mean during spectral initialization. Specifically, the initial 
estimate is obtained by computing the leading eigenvector z of the truncated sum 


Y := 


1 

TO 


m 


yi a i a i 1 {\ Vi \<al(± EUi 2/i)} 


(29) 


for some predetermined threshold a y , and then rescaling z so as to have roughly the same norm as x (which 
is estimated to be Y^iLi Jft)> see Algorithm [ll for the detailed procedure. 

Notably, the aforementioned drawback of tire spectral method is not merely a theoretical concern but 
rather a substantial practical issue. We have seen this in Fig. [2] (main quad example) showing the enormous 
advantage of truncated spectral initialization. This is also further illustrated in Fig. [6] which compares the 
empirical efficiency of both methods with a y = 3 set to be the truncation threshold. For both Gaussian 
designs and CDP models, the empirical loss incurred by the original spectral method increases as n grows, 


12 
















Table 1: Range of algorithmic parameters 


(a) When a fixed step size /i t = fi is employed: (a lb , a“ b ,ah,a y ) obeys 


Ci := max | 


E 


/l.Ola^ or |£|>\ 


C 2 E [C 2 l{|4|>0.473a fc }] 5 

2(Ci + < 2 ) + < 1-99, 


P(|C! < v / U0Ta4 b or Id > A/0J9a" b )} 


„ Oy Cl 3; 

where £ ~ A/"(0,1). By default, al. b = 0.3, aP b = ah = 5, and a y = 3. 

(b) When fj, t is chosen by a backtracking line search: (o4 b , a“ b , ah, a y , a p ) obeys 
0 < q! 1 < 0.1, a^ h >5, ah > 6, a y > 3, and a p > 5. 


By default, al, b = 0.1, a“ b = 5 , ah = 6, a y = 3, and a p = 5. 


(30) 


(31) 


which is in stark constrast to the truncated spectral method that achieves almost identical accuracy over 
the same range of n. 




(b) 


Figure 6: The empirical relative error for both the spectral and the truncated spectral methods. The 
results are averaged over 50 Monte Carlo runs, and are shown for: (a) 1-D Gaussian measurement where 
a,i ~ and to = 6n; (b) 2-D CDP model ^ where the diagonal entries of D are uniformly drawn 

from {1, — 1, j, —j}, n = ni x n-i with ni = 300 and 712 ranging from 64 to 1280, and m = 12n. 


2.3 Choice of algorithmic parameters 

One implementation detail to specify is the step size /r t at each iteration t. There are two alternatives that 
work well in both theory and practice: 

1. Fixed step size. Take ji t = /i (Vt G N) for some constant fj, > 0. As long as /r is not too large, 
our main results state that this strategy always works—although the convergence rate depends on /i. 
Under appropriate conditions, our theorems hold for any constant 0 < fi < 0.28. 
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2. Backtracking line search with truncated objective. This strategy performs a line search along 
the descent direction 

Pt ■ = —Vt'tr(z t ) 

TO 

and determines an appropriate step size that guarantees a sufficient improvement. In contrast to the 
conventional search strategy that determines the sufficient progress with respect to the true objective 
function, we propose to evaluate instead a regularized version of the objective function. Specifically, 
put 

Kz) ■= i yil °S(\ a Jz\ 2 ) - \a,Jz\ 2 } , (32) 

ieT(z) 

where 

T(z) := {* | \ajz\ > o4 b ||z|| and \ajp\ < a p ||p||} . 

Then the backtracking line search proceeds as 


(a) Start with r = 1; 

(b) Repeat r <— (3t until 

-F(z (t) + rp (t) ) > -?(z (t) ) + It\\ P w f, (33) 

TO TO 2 " " 

where /3 € (0,1) is some pre-determined constant; 

(c) Set p t = t. 


By definition (321, evaluating £(z^ + rp W) mainly consists in calculating the matrix-vector product 
A(z W + rpW). In total, we are going to evaluate ^(z*^ + rp^) for 0(log(l//3)) different r’s, and 
hence the total cost amounts to computing Az^\ Ap as well as 0(TOlog(l//3)) additional flops. 
Note that the matrix-vector products Az^ and Ap^ need to be computed even when one adopts 
a pre-determined step size. Hence, the extra cost incurred by a backtracking line search, which is 
0(m log(l//3)) flops, is negligible compared to that of computing the gradient even once. 


Ub 


.,ub 


Another set of important algorithmic parameters to determine is the trimming thresholds ah , a‘ z “, a“ 
and a p (for a backtracking line search only). The present paper isolates the set of (ah, o4 b , a“ b , a y ) 
obeying (30) as given in Table [l] when a fixed step size is employed. More concretely, this range subsumes 
as special cases all parameters obeying the following constraints: 


x v i 


0 < a lb < 0.5, 


„ub 


>5, ah > 5, and a y > 3. 


(34) 


When a backtracking line search is adopted, an extra parameter a p is needed, which we take to be a p > 5. 
In all theory presented herein, we assume that the parameters fall within the range singled out in Table [I] 


3 Why TWF works? 


Before proceeding, it is best to develop an intuitive understanding of the TWF iterations. We start with a 
notation representing the (unrecoverable) global phase |T] for real-valued data 


0 , if ||z — £c|| < ||z + *|| , 
7 r, else. 


It is self-evident that 


<i>(z) ■- 

(-z) + — Vtrf'( - z ) = - | z + — Vtrf(z)| , 


and hence (cf. Definition ([8])) 

dist f(-z) + — V tr i(—z),x] = dist (z + — X7 tI £ ( z ), a:) 
V TO / V TO / 


( 35 ) 
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despite the global phase uncertainty. For simplicity of presentation, we shall drop the phase term by letting 
z be z and setting h = z — x, whenever it is clear from context. 

The first object to consider is the descent direction. To this end, we find it convenient to work with 
a fixed z independent of the design vectors a.j, which is of course heuristic but helpful in developing some 
intuition. Rewrite 


W i(z) = 


n (aJxf-(aJ Z )\ 

Z -r- do 


(0 n ( a Jh)(2ajz - ajh) ~ 

- Z -r- do 


= -4 (aj h)a,i + 2 


aj z 


do 


(36) 


where (i) follows from the identity a 2 — b 2 = (a + b) (a — b). The first component of (36), which on average 
gives —Ah, makes a good search direction when averaged over all the observations i = 1,..., m. The issue is 
that the other term r,—which is in general non-integrable—could be devastating. The reason is that aj z 
could be arbitrarily small, thus resulting in an unbounded r,. As a consequence, a non-negligible portion of 
the rq’s may exert a very strong influence on the descent direction in an undesired manner. 

Such an issue can be prevented if one can detect and separate those gradient components bearing abnormal 


tVs. Since we cannot observe the individual components of the decomposition (361, we cannot reject indices 
with large values of r,; directly. Instead, we examine each gradient component as a whole and discard it 
if its size is not absolutely controlled. Fortunately, such a strategy is sufficient to ensure that most of the 
contribution from the regularized gradient comes from the first component of (36), namely, —A(aJh)a.i. As 


will be made precise in Proposition [2] and Lemma [7] the regularized gradient obeys 


ivf.-W.ft) a ( 4 -*)INI a -o(MI) 


and 


-w te (z) 

TO 


< llhll. 


(37) 

(38) 


Here, one has (4 — e)||/i|| 2 in (37) instead of 4||fa|| 2 to account for the bias introduced by adaptive trimming, 


where e is small as long as we only throw away a small fraction of data. Looking at (37) and (381 we see that 


the search direction is sufficiently aligned with the deviation —h = x — z of the current iterate; i.e. they 
form a reasonably good angle that is bounded away from 90°. Consequently, 2 is expected to be dragged 
towards x provided that the step size is appropriately chosen. 

The observations (371 and (38) are reminiscent of a (local) regularity condition given in [l], which is a 
fundamental criterion that dictates rapid convergence of iterative procedures (including WF and other gra¬ 
dient descent schemes). When specialized to TWF, we say that -Wftr (•) satisfies the regularity condition, 
denoted by RC (//, A, e), if 


(h,--Vl tI (z)\ > £ -VMz) 

\ m / z m 




(39) 


holds for all ,z obeying ||z — x\\ < e||a:||, where 0 < e < 1 is some constant. Such an e-ball around x forms a 
basin of attraction. Formally, under RC (/n, A, e), a little algebra gives 


dist 2 (z + — Vi'tr (z), x) < 
V to / 


2 + — V£ tr (z) - x 

TO 


= iifcir + 


-V4r (Z) 


• 2/x (h, — tr ( z ) 

\ m 


< \\h\\ 2 + ^W t r (Z) 

m 

= (1 — fiX) dist 2 (z, x) 


—V£ tr (z) 
m 


-V\\\h\\ 


(40) 


for any z with \\z — x\\ < e. In words, the TWF update rule is locally contractive around the planted 
solution, provided that RC (//, A, e) holds for some nonzero /r and A. Apparently, Conditions (37) and (381 
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Figure 7: A function —£{z) satisfying RC: —£(z) = z 2 for any z £ [—6,6], and —£{z) = z 2 + 

1.5|z| (cos(|z| — 6) — 1) otherwise. 

already imply the validity of RC for some constants fi, A 1 when ||/i||/||z|| is reasonably small, which in 
turn allows us to take a constant step size /i and enables a constant contraction rate 1 — /iA. 

Finally, caution must be exercised when connecting RC with strong convexity, since the former does not 
necessarily guarantee the latter within the basin of attraction. As an illustration, Fig. [7] plots the graph of a 
non-convex function obeying RC. The distinction stems from the fact that RC is stated only for those pairs 
z and h = z — x with x being a fixed component, rather than simultaneously accommodating all possible z 
and h = z — z with z being an arbitrary vector. In contrast, RC says that the only stationary point of the 
truncated objective in a neighborhood of x is x , which often suffices for a gradient-descent type scheme to 
succeed. 


4 Numerical experiments 


In this section, we report additional numerical results to verify the practical applicability of TWF. In all 
numerical experiments conducted in the current paper, we set 


a 1 , 15 = 0.3, 


nb 


= 5, ah = 5, and a y = 3. 


(41) 


This is a concrete combination of parameters satisfying our condition (30). Unless otherwise noted, we 


employ 50 power iterations for initialization, adopt a fixed step size /it = 0.2 when updating TWF iterates, 
and set the maximum number of iterations to be T = 1000 for the iterative refinement stage. 

The first series of experiments concerns exact recovery from noise-free data. Set n = 1000 and generate 
a real-valued signal x at random. Then for to varying between 2 n and 6 n, generate to design vectors Gq 
independently drawn from Af (0, 1). An experiment is claimed to succeed if the returned estimate x satisfies 
dist (x,x) / ||x|| < 10 -5 . Fig. [8] illustrates the empirical success rate of TWF (over 100 Monte Carlo trials 
for each to) revealing that exact recovery is practially guaranteed from fewer than 1000 iterations when the 
number of quadratic constraints is about 5 times the ambient dimension. 

To see how special the real-valued Gaussian designs are to our theoretical finding, we perform experiments 
on two other types of measurement models. In the first, TWF is applied to complex-valued data by generating 
cq ~ Af (0,1/) + jAf (0, | J). The other is the model of coded diffraction patterns described in (19]). Fig. hH 
depicts the average success rate for both types of measurements over 100 Monte Carlo trials, indicating that 
to > 4.5n and to > 6n are often sufficient under complex-valued Gaussian and CDP models, respectively. 

For the sake of comparison, we also report the empirical performance of WF in all the above settings, 
where the step size is set to be the default choice of [l], that is, fi t = minjl — e - U 330 ,0.2}. As can be seen, 
the empirical success rates of TWF outperform WF when T = 1000 under Gaussian models, suggesting that 
TWF either converges faster or exhibits better phase transition behavior. 

Another series of experiments has been carried out to demonstrate the stability of TWF when the number 
to of quadratic equations varies. We consider the case where n = 1000, and vary the SNR (cf. ©) from 15 
dB to 55dB. The design vectors are real-valued independent Gaussian ~ AF ( 0, /), while the measurements 
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Figure 8: Empirical success rate under real-valued Gaussian sampling a.; 


Af(0 ,I n ). 


yi are generated according to the Poisson noise model Q. Fig. 10 shows the relative mean square error— 
in the dB scale—as a function of SNR, when averaged over 100 independent runs. For all choices of to, 
the numerical experiments demonstrate that the relative MSE scales inversely proportional to SNR, which 
matches our stability guarantees in Theorem [2] (since we observe that on the dB scale, the slope is about -1 
as predicted by the theory ©)• 


5 Exact recovery from noiseless data 

This section proves the theoretical guarantees of TWF in the absence of noise (i.e. Theorem[l]|. We separate 
the noiseless case mainly out of pedagogical reasons, as most of the steps carry over to the noisy case with 
slight modification. 

The analysis for the truncated spectral method relies on the celebrated Davis-Kahan sinO theorem |56j, 
which we defer to Appendix [c] In short, for any fixed <5 > 0 and x £ R n , the initial point returned by 
the truncated spectral method obeys 

dist(;s/ 0) , x) < <5||a;| 

with high probability, provided that m/n exceeds some numerical constant. With this in place, it suffices to 
demonstrate that the TWF update rule is locally contractive, as stated in the following proposition. 

Proposition 1 (Local error contraction). Consider the noiseless case 0- Under the condition 
there exist some universal constants 0 < po < 1 and Co,Ci,C 2 > 0 such that with probability exceeding 
1 — Ci exp (—C 2 to), 


dist 2 (z + — V4r (z) , a:) < (1 — po) dist 2 (z, x) 

V to / 

holds simultaneously for all x, z £ R n obeying 


dist (z, x) 


1 al b a? 5.7 (4 b ) 5 


< min < —, - 5 -, —r- 


||z|| I 11 ’ 3a^’ 6 ’ 2a“ b + a 

provided that m > cgn and that y, is some constant obeying 


lb 


O . .. / .. °- 994 - Ci - Ca - VVW)u h 

2(1.02 + 0.665/a^) 


-l 


(42) 


(43) 
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(a) (b) 

Figure 9: Empirical success rate for exact recovery using TWF. The results are shown for (a) complex¬ 
valued Gaussian sampling a, ~ Af{0, \l n ) + j"7V(0, \l n ), and (b) CDP with masks uniformly drawn from 
{ 1 ,- 1 , 3 , -]}■ 


Proposition [T] reveals the monotonicity of the estimation error: once entering a neighborhood around x 
of a reasonably small size, the iterative updates will remain within this neighborhood all the time and be 
attracted towards cc at a geometric rate. 

As shown in Section [3] under the hypothesis RC (/x, A, e) one can conclude 

dist 2 (z + — V£ t r( 2: ), x) < (1 — /xA)dist 2 (,z, x), V(z, x) with dist(z, x) < e. (44) 

Thus, everything now boils down to showing RC (/x, A, e) for some constants /x, A, e > 0. This occupies the 
rest of this section. 


5.1 Preliminary facts about {£{] and {£ 3 } 

Before proceeding, we gather a few properties of the events £[ and £ 3 , which will prove crucial in establishing 
RC(/x, A,e). To begin with, recall that the truncation level given in depends on A ||A(a;a; T — zz T )|| . 
Instead of working with this random variable directly, we use deterministic quantities that are more amenable 
to analysis. Specifically, we claim that A ||_T ( xx T — zz T ) || offers a uniform and orderwise tight estimate 
on ||/i|| ||z||, which can be seen from the following two facts. 

Lemma 1 . Fix ( € (0,1). If m > conf~ 2 log then with probability at least 1 — C exp(—ci£ 2 m), 

0.9 (1 - C) \\M\\ f < i \\A (M)!!, < (1 + 0 V2 \\M\\ f (45) 


holds for all symmetric rank-2 matrices M £ Here, c$,C\,C > 0 are some universal constants. 

Proof. Since ( 6 , Lemma 3.1] already establishes the upper bound, it suffices to prove the lower tail bound. 
Consider all symmetric rank-2 matrices M with eigenvalues 1 and —t for some — 1 < t < 1. When t £ [0,1], 
it has been shown in the proof of [6, Lemma 3.2] that with high probability, 

— \\A(M)\\i > (1 — C ) / (*), ( 46 ) 

m 


for all such rank-2 matrices M, where / (t) := {2 y/t + (1 — t) (tt/2 — 2arctan(v / t)) }• 

in this case can then be justified by recognizing that / (f) /y/l + t 2 > 0.9 for all t £ [0,1] 
Fig. 11 The case where t £ [—1, 0] is an immediate consequence from |6 ; Lemma 3.1]. 


The lower bound 
, as illustrated in 
□ 
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Figure 10: Relative MSE vs. SNR when the yf s follow the Poisson model. 



Figure 11: as a f unc ti° n of t. 


Lemma 2. Consider any x, z £ R" obeying \\z — x|| <5 ||z|| for some 5 < \. Then one has 

V2 — 4S \\z — x\\ \\z\\ < \\xx T — zz T || p < (2 + 5) ||z — x\\ ||z||. (47) 

Proof. Take ft = z — x and write 

11cca3 T — zz T ||p = \\ — hz T — zh T + hh T \\t 

II Hr II Hr 

= ||ftz T + zft T || p + ||ft|| 4 — 2 (hz T + zh T , ftft T ) 

= 2 ||z|| 2 ||ft|| 2 + 2|ft T z| 2 + ||ft|| 4 - 2||ft|| 2 (ft T z + z T h). 

When 11ft,11 < |||z||, the Cauchy-Schwarz inequality gives 


2||z|| 2 ||ft|| 2 -4||z||||ft|| 3 <|| a: x T -zz T || F <4||z|| 2 ||ft|| 2 +4||ft|| 3 ||z|| + ||ft|| 4 , (48) 

=> V (2 I|z|| - 4 ||ft||) ||z|| • ||ft|| < ||** T - zz T |j F < (2 ||z|| + ||ft||) • ||ft|| (49) 

as claimed. □ 

Taken together the above two facts demonstrate that with probability 1 — exp (—12 (to)), 

1-1511 z - *|| \\z\\ < ^ \\A(xx T - zz T )|| 1 < 3||z- *|| ||z|| (50) 
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holds simultaneously for all z and x satisfying ||h|| < ^ ||z||. Conditional on (50), the inclusion 


C £* 

C £\ 

(51) 

a J z I 2 

< 1.15 a h ||h|| • \ajz }, 

(52) 

a J z \ 2 \ 

< 3 a h ||h|| ■ \ajz } . 

(53) 


holds with respect to the following events 

£3 ■ = {I \ a l x \ 2 

ci . _ ri|^T_|2 


The point of introducing these new events is that the fg’s (resp. £|’s) are statistically independent for any 
fixed x and z and are, therefore, easier to work with. 

Note that each £\ (resp. £|) is specified by a quadratic inequality. A closer inspection reveals that in 
order to satisfy these quadratic inequalities, the quantity ajh must fall within two intervals centered around 
0 and 2 aj z, respectively. One can thus facilitate analysis by decoupling each quadratic inequality of interest 
into two simple linear inequalities, as stated in the following lemma. 


Lemma 3. For any 7 > 0, define 


VI 




V. 


and Vi 


i, 1 


\ a J X \ 2 - \ a J Z \ 2 \ 

,T| 


< 711^11 \ a J z \} 


K h \ 

llhll 


< 7 


aj h 2 aj z 


M 


\h\\ 


< 7 


(54) 

(55) 

(56) 


Thus, Vij 1 and Vf 2 represent the two intervals on ajh centered around 0 and 2 ajz. If then the 

following inclusion holds 


V 1 


1 +V 2 


v^nsi c (vi ; 1 n £[) u (vi ^ 2 n £{). 

5.2 Proof of the regularity condition 


(57) 


By definition, one step towards proving the regularity condition (39 ) is to control the norm of the regularized 
gradient. In fact, a crude argument already reveals that ||—Wtr(-z)|| < \\h\\. To see this, introduce v = 


Ni<i<m with v i := 2 
property (511 that 


j, T zl > 


and 


1 £jn£* ■ It comes from the trimming rule £\ as well as the inclusion 

; INI INI, 


Vi - Wi z 


1 


< — j| A(xx 1 - .zz 
m 


T \ 


implying \vi\ < \\h\\ and hence \\v\\ < v /m||/i||. The Marchenko-Pastur law gives ||A|| < ywhence 

Ml<l|fc||- ( 58 ) 


l||VM*)ll = -hl^ll< 1 MII 

m mm 


A more refined estimate will be provided in Lemma [7j 

The above argument essentially tells us that to establish RC, it suffices to verify a uniform lower bound 
of the form 

~(h, ^V£ tr (z)} > \\h \\ 2 , (59) 

as formally derived in the following proposition. 

Proposition 2. Consider the noise-free measurements yi = la^ajj 2 and any fixed constant e > 0. Under 


the condition (30), if m > C\n, then with probability exceeding 1 — C'exp(— Com), 
-(h,^VM*)) > 2 {l.99 - 2 (Ci + C 2 ) - ^8/(9^)^ 1 - e} 


h II 


(60) 
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holds uniformly over all x, z £ 



ai b al b 5.7 (al b )" 


1 


11’ 3^’ 6 ’ 2a" b + « 


„lb 


(61) 


Here, Cq,C\,C > 0 are some universal constants, and (i and f 2 ure defined in (| 3C 

The basic starting point is the observation that (aj z) — (ajx) 2 = ( ajh)(2aj z — aj ft) and hence 

-Tvm„) - 

/m m J n ' -r 




i=l 


m 1 m 

= — E 2 ( a ^ /l ) a * 1 £;n£i- E 

m 1 ^ m ^^ 


(a^) 2 


i=l 


ma 7 z 

1=1 1 


a il£‘n£‘- 


(62) 


One would expect the contribution of the second term of (62) (which is a second-order quantity) to be small 
as \\h\\ / ||z|| decreases. 

To facilitate analysis, we rewrite (621 in terms of the more convenient events Vf 1 and T>5j 2 . Specifically, 
the inclusion property (511 together with Lemma [3] reveals that 

24 1 n£{ C nfj c nfj c £• nf{ c (^uD‘f) n£{. 


where the parameters 73,74 are given by 

73 := 0A76a h , and 74 := 3a h - 


(63) 

(64) 


This taken collectively with the identity (621 leads to a lower estimate 


-<- v 4,(zU)> 


E ( a J h ) 1 einvi,f E 


la- fe 


i=i 


i=l 


s[rv^l 


--E 


a. ; ft. 


i=l 


L £;n-p;- 4 2 ’ 


(65) 


leaving us with three quantities in the right-hand side to deal with. We pause here to explain and compare 
the influences of these three terms. 

To begin with, as long as the trimming step does not discard too many data, the first term should be 
close to 2 , \ajh\ 2 , which approximately gives 2||ft|| 2 from the law of large numbers. This term turns 

out to be dominant in the right-hand side of ( |65| ) as long as ||ft||/||z|| is reasonably small. To see this, 
please recognize that the second term in the right-hand side is 0(||ft|| 3 /||z||), simply because both aj ft and 
ajz are absolutely controlled on 'Dfd n £\. However, Vf 2 does not share such a desired feature. By the 
very definition of T>f 2 , each nonzero summand of the last term of (651 must obey |a^~ft| ss 2 Ja^z] an dj 

I T *l 3 

therefore, 1 \ | lgi nv*' 2 roughly of the order of ||z|| 2 ; this could be much larger than our target level 

||ft|| 2 . Fortunately, F> 1 ' 2 is a rare event, thus precluding a noticable influence upon the descent direction. 
All of this is made rigorous in Lemma [4] (first term), Lemma [ 5 ] (second term) and Lemma [6] (third term) 
together with subsequent analysis. 


Lemma 4. Fix 7 > 0, and let £{ and Vfj 1 be defined in (2f) and (55), respectively. Set 


Cl := 1 — min 

ti 


{e c 2 


L { Vl-01a lb < 1 5 1 < v / 0i99a } 


,E 


L {v / TioTa 1 a! b <|5|<v , 0 


.99aJ b }] } 


( 66 ) 

and (2 ’■= IE £ ^{|{|>v'o. 997 } > 
where £ ~ Af (0, 1). For any e > 0, if m > C\ne loge - , then with probability at least 1 — Cexp(— Coe 2 m), 

m 

-El^rW - (Wi^-eHftf (68) 


holds for all non-zero vectors h,z £ R n . Here, cq,c\,C > 0 are some universal constants. 
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We now move on to the second term in the right-hand side of (651. For any fixed 7 > 0, the definition of 
E\ gives rise to an upper estimate 


1 171 I T u 

1 v— > a.j h 

E 


m 


ajz\ £ l nv -i a lb \\z\\ m 


1 1 V'' | TjJ 3 -, ^ (1 + £ ) \/ 8/7T ||ft.|| 

E l° ih l V’ 1 - - oTO- 


(69) 


where a/ 8 / 7 t ||h || 3 is exactly the untruncated moment E[|a^~h| 3 ]. The second inequality is a consequence of 
the lemma below, which arises by observing that the summands \ajh\ 3 l v i,i are independent sub-Gaussian 
random variables. 

Lemma 5. For any constant 7 > 0, if m/n > cq • cT 2 loge -1 , then 


^2 ", h 1 dL' 1 ^ (! + e) ll^ll 3 j V/ie 


(70) 


i=l 


with probability at least 1 — C exp(— c\e 2 m) for some universal constants cq, c±, C > 0. 


It remains to control the last term of (651. As mentioned above, the influence of this term is small since 
the set of af s satisfying V 1 ’ 2 accounts for a small fraction of measurements. Put formally, the number of 
equations satisfying (a/" h | > 7 ||h|| decays rapidly for large 7 (at least at a quadratic rate), as stated below. 

Lemma 6 . For any 0 < e < 1, there exist some universal constants cq,Ci,C > 0 such that 


1 m 1 

- E 1 {|aT h |> 7 ||fc||} ^ ex P ( 0.485 7 2 ) +± Vh e R n \{ 0 } and 7 > 2 

i—1 ’ ' 

with probability at least 1 — C exp (— coe 2 rn). This holds with the proviso m/n > ci • e - 2 loge -1 . 


To connect this lemma with the last term of (651, we recognize that when 7 < , one has 

1 £{DD\' 2 ^ 1 {|o i T fc|>a»>||z||}- 


(71) 


(72) 


The constraint 


i • h 2a^ ; 

rar _ 


< 7 of Vf 2 necessarily requires 


\aj h I > 2 \aj z 


2a lb ||z|| 

- 7 > f 1 " - 7 > 


oi v ?\\z\ 


(73) 


.M ~ ||h|| M IN ’ 

where the last inequality comes from our assumption on 7 . With Lemma [ 6 ] in place, (721 immediately gives 


E 1 £i 


i= 1 


inx>fy - 


< 


0.49a lb IIz 


1 


9800 V oi l ? z 


exp -0.485 


+ 


K b ) 


4 b N 

IN 

N 

fyl 


ellfcll 


N b ) 2 NI 2 


(74) 


as long as -pjj < where the last inequality uses the majorization 2 oooo:r 4 — x ex P ( — 0.485x 2 ) holding for 
any x > 6 . 

In addition, on £[ n , the amplitude of each summand can be bounded in such a way that 


1 Tj,l 3 

\ a hi 

1*1 

2 ajz\ + 7 \\h\\ 

(2af ||z||+7 ||^||) 2 

(75) 

\ a J z \ 

\ a J z \ 

< 1 

l 4 b Ni; 

( 2 “" b + 1 H) 2|bf ' 

(76) 
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where both inequalities are immediate consequences from the definitions of Vf 2 and S\ (see (56) and (24)). 
Taking this together with the cardinality bound (741 and picking e appropriately, we get 


m I t 


E 

i=l 


a■ h\ 




jlMJ 

rylb llzl 


I s ) ( 2 a ub ,,, I, 

A * + 7 h*ii; \M 


9800 (aS 3 ) 


Furthermore, under the condition that 


■Si 


\z\\ , ||h|| . a/ 98 (a 


lb\ 


7 -“'M and ||z|| - ^(2a“b +a ib)’ 


one can simplify (771 by observing that t?i < which results in 


M 


(77) 


m | y 


-E 

to t' 


a, h 




< 


1 

Too 


Ifel 


(78) 


Putting all preceding results in this subsection together reveals that with probability exceeding 1 — 
exp (—f 2 (to)), 


A A*' W ) 2 {l-99-2(( I +< 2 )- v ^J) j L-3 e } ll h|r 


> 


{1.99 - 2 (Ci + C 2 ) - \78M3aA 1 - 3e} 


holds simultaneously over all x and z satisfying 


\h\\ 


< min ■ 


a? A AW3 (A b ) 2 J_' 
3a h ’ 6 ’ 2a“ b + a lb ’ 11 


(79) 


(80) 


as claimed in Proposition [2] 

To conclude this section, we provide a tighter estimate about the norm of the regularized gradient. 

Lemma 7. Fix 5 > 0, and assume that yi = ( ajx) 2 . Suppose that m > con for some large constant cq > 0. 
There exist some universal constants c, C > 0 such that with probability at least 1 — C exp (—cm), 


-||VM*)1 < (1 + <J) -4V1-02 + 0.665/a h ||h|| 


m 


(81) 


holds simultaneously for all x, z £ R n satisfying jj^jj < min | A+ A’ 


( b a lb y / 98/3(al b ) 2 1 


2 a^+CKy 


• A}- 


Lemma [7] complements the preceding arguments by allowing us to identify a concrete plausible range for 
the step size. Specifically, putting Lemma [7] and Proposition [2] together suggests that 


~(h,-V£ tl (z)) > 
\ m / 


:{l-99 - 2 (Cl + C 2 ) - V*W*)<*h 1 - e } 
(1 + A • 16(1.02 + 0.665/^) 


-V4r (Z) 
m 


Taking e and S to be sufficiently small we arrive at a feasible range (cf. Definition (39)) 

0.994 - Ci - Ca ~ vAAQy 1 

2 (1.02 + 0.665/ah) ' ^' 


(82) 


(83) 


This establishes Proposition and in turn Theorem [T] when y t is taken to be a fixed constant. 

To justify the contraction under backtracking line search, it suffices to prove that the resulting step size 
falls within this range ([83]) , which we defer to Appendix [Dj 
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6 Stability 


This section goes in the direction of establishing stability guarantees of TWF. We concentrate on the iterative 
gradient stage, and defer the analysis for the initialization stage to Appendix |C| 

Before continuing, we collect two bounds that we shall use several times. The first is the observation that 


— II V ~ A{zz T )\\ 1 < —1| A{xx t - zz T )\\ 1 + — ||»?||i 
mm m 


< \\h\\\\z\\ 


i 


m 


mi 


< 


(84) 


where the last inequality follows from Cauchy-Schwarz. Setting 


- aJz 


as usual, this inequality together with the trimming rules £[ and £\ gives 

kl<IN- 111711 


^m\\z 


| = — A t i> | < 


1 mN H — 

y/rn 


(i) 


< \\h\ 


(85) 


v/mlp 


where (i) arises from |40, Corollary 5.35]. 

As discussed in Sectional the estimation error is contractive if — A-VUr ( z ) satisfies the regularity con¬ 
dition. With (851 in place, RC reduces to 


--(ve tI (z),h) > 


M 


( 86 ) 


Unfortunately, (861 does not hold for all 2 within the neighborhood of x due to the existence of noise. 


Instead we establish the following: 


• The condition (861 holds for all h obeying 

'\v\\ 


C3- 


< \\h\\ < C 4 \\x\\ 


(87) 


for some constants 03,04 > 0 (we shall call it Regime 1)\ this will be proved later. In this regime, the 
reasoning in Section [3] gives 


distf^ + — WtAz), < (1 — p)dist( 2 , x) 
V to / 


( 88 ) 


for some appropriate constants /x, p > 0 and, hence, error contraction occurs as in the noiseless setting. 
• However, once the iterate enters Regime 2 where 

C3 ||l7|| 


\h \I < 


(89) 


Vm\\zr 

the estimation error might no longer be contractive. Fortunately, in this regime each move by — VtA ( z ) 


is of size at most 0(-^= p][), compare (85). As a result, at each iteration the estimation error cannot 
increase by more than a numerical constant times before possibly jumping out (of this regime). 


Therefore, 


dist(z + —V^tr (z), x) < c 5 - 

to vmlMI 


m 


(90) 


for some constant C 5 > 0. Moreover, as long as ||f 7 ||oo /||®|| 2 is sufficiently small, one can guarantee 
that C 5 y=p^- < C 5 < C 4 1111. In other words, if the iterate jumps out of Regime 2, it will still fall 
within Regime 1. 
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To summarize, suppose the initial guess z^ obeys dist(z(°\ a;) < C 4 11a?11 - Then the estimation error will 
shrink at a geometric rate 1 — p before it enters Regime 2. Afterwards, z® will either stay within Regime 


2 or jump back and forth between Regimes 1 and 2. Because of the bounds (90) and ( 881 , the estimation 
errors will never exceed the order of from then on. Putting these together establishes ( jl6| , namely, 

the first part of the theorem. 


Below we justify the condition (861 for Regime 1, for which we start by gathering additional properties 
of the trimming rules. By Cauchy-Schwarz, ^ || 77 H x < ||rj|| < ^ ||/i|| ||z||. When C 3 is sufficiently large, 

applying Lemmas [l] and [2] gives 


J_ V" 1 

m 2-^1— 1 

1 s~^m 
m 2-^1= 1 


yi - 

a 

n 

to 

< 

x 

m 

\A (xx T - zz T ) 

yi - 

1 T I 2 

K z \ 

> 

m 

\A (xx t - zz T )\\ 


(91) 


From now on, we shall denote 8\ := | 11 0 .A cc | 2 — | a.^ ^ | 2 1 < || y — A(^ 2 : T )|| 1 | to differentiate from 

8 2 . For any small constant e > 0, we introduce the index set Q := {i : \r]i\ < C t ||r 7 || /y/rn} that satisfies 
\Q\ = (1 — e)m. Note that C e must be bounded as n scales, since 


\v\\ 2 - E rfc ^ 2 - ( TO_ |£|) • Ce\\v\\ 2 /m > eC; 


' 2 "r?|| 2 


C t < 1/ \fk. 


(92) 


We are now ready to analyze the regularized gradient, which we separate into several components as 
follows 


V tr £(z) = 2J2 
ieQ 


I T I 2 I T I 2 I T I 2 I T I 2 

-\a>z\ v-^ | a l x | -\ a i z \ 

-r- a i 1 -sin£i + z / ,-=p- 

a i z ~Ar a Z 
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ajz ail£ ' n£ * 


iiQ 


. =V alean^( z ) 
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1 


Vi - K z \ 


a; x — a z 


r&2 


a i- 


(93) 


:=VS 






For each index i G G, the inclusion property (511 (i.e. 8\ C 8\ C £|) holds. To see this, observe that 

\ a J x \ 2 - \ajz\ 2 \ ± \Vi\ 


I Vi ~ \°Jz\' 2 \ G 


Since \r)i\ < C e 11 77 11 /Am -C ||fo||||z|| when C 3 is sufficiently large, one can derive the inclusion (511 
immediately from (911. As a result, all the proof arguments for Proposition [ 2 ] carry over to Vj?) ean £ (z), 
suggesting that 


- < K - V tr 


clean / 


m 


“!?(*)} > 2 {1.99 — 2(Ci +C2) - V / 8/(9)r)^ 1 - e }||h|| 2 . (94) 

Next, letting Wi = £ i n s^{ieG }> we see that for any constant 5 > 0, the noise component obeys 


-V t n r olse ^) 

m 


—A t w 

m 


< 


m 


m 




(95) 


provided that m/n is sufficiently large. Here, (ii) arises from |40j Corollary 5.35], and the last inequality 
is a consequence of the upper estimate 




( 96 ) 
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In turn, this immediately gives 


/ h ,-v^r e e(z)) 

-c 

VI 

-V" oise £ (z) 

\ m / 
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< 2(1 ^> 


vlb 


\/ to ||; 


(97) 


We now turn to the last term V®* tra C (z). According to the definition of and E\ as well as the property 

| 2 — \a T z\ 2 \ 

a r z 1 — ~^-s i rs i ) bounded in magnitude by 


(911, the weight q* := 2 
oJjn||. This gives 


V a i Z 




< \Jm - \Q\ • 6||/i|| < 6y/em\\h\\ 


and hence 


(I V - tra g(z),fr\ < M ' ||-V- tra f(z) || = -||h|| ■ || A T q|| <6(l + S)Ve\\h\ 

\ m / " m 11 m 11 11 


Taking the above bounds together yields 


-(V£ tr (z) , h) > 2 a.99 - 2 (Cl + C 2 )- 


97t a h 


- 6(1 + 1 


\h\ 


2(1 +<5) M 


\h\\. 


a? xMlkll" 

Since \\h\\ > C 3 ^=p^ for some large constant C 3 > 0, setting e to be small one obtains 

-^(V4r (z),h) > 2{l.95-2(Ci+C2)- Vs/W)^ 1 } \\hf 

for all h obeying 


cal 


TO 


< \\h\\ < 


1 al b a lb vW3(al b ) S 


ll’3aft,’ 6 ’ 2a“ b + al b 


(98) 


(99) 


which finishes the proof of Theorem [2] for general r/. 

Up until now, we have established the theorem for general r 7 , and it remains to specialize it to the Poisson 
model. Standard concentration results, which we omit, give 


1 

TO 


/ / L _j lit 

i =1 i=l 


( 100 ) 


with high probability. Substitution into (16) completes the proof. 


7 Minimax lower bound 

The goal of this section is to establish the minimax lower bound given in Theorem[3] For notational simplicity, 

we denote by P(y | w) the likelihood of yi Poissonda^t^l 2 ), 1 < i < m conditional on {a;}. For any 
two probability measures P and Q , we denote by KL (P||Q) the Kullback-Leibler (KL) divergence between 
them: 

KL(P||Q):=|log(^dP, (101) 

The basic idea is to adopt the general reduction scheme discussed in |uj, Section 2.2], which amounts to 
finding a finite collection of hypotheses that are minimally separated. Below we gather one result useful for 
constructing and analyzing such hypotheses. 
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Lemma 8 . Suppose that a,i ~ Af (0,I n ), n is sufficiently large, and m = Kn for some sufficiently large 
constant n > 0. Consider any x £ K”\{0}. On an event B of probability approaching one, there exists a 
collection A4 of M = exp(n/30) distinct vectors obeying the following properties: 

(i) x £ M; 

(ii) for all w^ l \w^ £ A4, 


(Hi) for all w £ A4, 


I / a /8 — ( 2 n )~ 1/2 < 
\aj (w-x) | 2 


,(0 


— II < 3/2 + n 1 / 2 ; 


< 


w — x\\ 


\a : x\ 


-{2 + 17log 3 m}, 1 < i < to. 


( 102 ) 


(103) 


In words, Lemma [ 8 ] constructs a set M. of exponentially many vectors/hypotheses scattered around x 
and yet well separated. From (ii) we see that each pair of hypotheses in M. is separated by a distance 
roughly on the order of 1, and all hypotheses reside within a spherical ball centered at x of radius 3/2 + o(l). 
When ||cc|| > log 1 ' 5 ™, every hypothesis w £ M. satisfies ||io|| « ||a:|| 1. In addition, (iii) says that the 

quantities \aj (w — x) |/|a/a:| are all very well controlled (modulo some logarithmic factor). In particular, 
when ||x|| > log 1 ' 5 ™., one must have 


^ (^-^)| 2 < l ^-^ ll 2 log 3 m < 


* 


log 5 m 


log 3 to < 1 . 


(104) 


In the Poisson model, such a quantity turns out to be crucial in controlling the information divergence 
between two hypotheses, as demonstrated in the following lemma. 


Lemma 9. Fix a family of design vectors {a;}. Then for any w and r £ M™, 

KL( P(y\w + r) || P(y | n?)) < l a ^ r | 2 ( 8 + j | 2 ) • 

Lemma [9] and (104) taken collectively suggest that on the event BC\C (B is in Lemma [ 8 ] and C := {||A|| < 
\J2 to}), the conditional KL divergence (we condition on the a/s) obeys 


(105) 


/ \ ^ a , FTL - —1— - 9 0 

KL(F(y\w) || P {y \ x)) < C 3 > \aj (w — x)\ < 2c^m \\w — x\\ , \/w € M] 


(106) 


here, the inequality holds for some constant C 3 > 0 provided that ||®|| > log 1 "’™, and the last inequality is 
a result of C (which occurs with high probability). We now use hypotheses as in Lemma [8] but rescaled in 
such a way that 

||tu — a;|| x 6 , and ||tt> — w\\ x 5 , Vtn, w £ A4 with w / w. (107) 

for some 0 < S < 1. This is achieved via the substitution w < — x + 6(w — x)\ with a slight abuse of notation, 
A4 denotes the new set. 

The hardness of a minimax estimation problem is known to be dictated by information divergence in¬ 
equalities such as (106 1 . Indeed, suppose that 

1 ^ ... , , . x 1 


E, 


KL( P(y\w) || P (y \ x )) < — log (M - 1) 


M — 1 v ’10 

holds, then the Fano-type minimax lower bound |4lj Theorem 2.7| asserts that 

inf sup E [||a; — x\\ | {a^}] > 


x x£A4 


mm || w — w || 

w , ib G A4, w ^ ib 


Since M = exp(n/30), (108) would follow from 

2cs\\w — *|| 2 < n/(300m). w £ M. 


(108) 


(109) 


( 110 ) 
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Hence, we just need to select 8 to be a small multiple of yjn/m. This in turn gives 


inf sup E [||a: — *|| I {cq}l > min \\w — w || > yfn/m. (HI) 

x X (zj{4 w,w^w 

Finally, it remains to connect ||* — *|| with dist (*,*). Since all the w £ M are clustered around x 
and are at a mutual distance about <5 that is much smaller than ||cc||, we can see that for any reasonable 
estimator, dist(x, *) = ||* — *||. This finishes the proof. 


8 Discussion 


To keep our treatment concise, this paper does not strive to explore all possible generalizations of the theory. 

There are nevertheless a few extensions worth pointing out. 

• More general objective functions. For concreteness, we restrict our analysis to the Poisson log- 
likelihood function, but the analysis framework we laid out easily carries over to a broad class of 
(nonconvex) objective functions. For instance, all results continue to hold if we replace the Poisson 
log-likelihood by the Gaussian log-likelihood; that is, the polynomial function — (th ~ | a J z \ 2 ) 2 
studied in [I]. A general guideline is to first check whether the expected regularity condition 

E [-(£ V M*).h>] > IN 2 

holds for any fixed z within a neighborhood around x. If so, then often times RC holds uniformly within 
this neighborhood due to sharp concentration of measure ensured by the regularization procedure. 

• Sub-Gaussian measurements. The theory extends to the situation where the a, ’s are i.i.d. sub- 
Gaussian random vectors, although the truncation threshold might need to be tweaked based on the 
sub-Gaussian norm of a,. A more challenging scenario, however, is the case where the a^s are generated 
according to the CDP model, since there is much less randomness to exploit in the mathematical 
analysis. We leave this to future research. 


Having demonstrated the power of TWF in recovering a rank-one matrix xx* from quadratic equations, 
we remark on the potential of TWF towards recovering low-rank matrices from rank-one measurements. 
Imagine that we wish to estimate a rank-r matrix X >; 0 and that all we know about X is 

Hi = aJXa,i, 1 < i < m. 

It is known that this problem can be efficiently solved by using more computational-intensive semidefinite 
programs [9 14 . With the hope of developing a linear-time algorithm, one might consider a modified TWF 
scheme, which would maintain a rank-r matrix variable and operate as follows: perform truncated spectral 
initialization, and then successively update the current guess via a regularized gradient descent rule applied 
to a presumed log-likelihood function. 

Moving away from i.i.d. sub-Gaussian measurements, there is a proliferation of problems that involve 
completion of a low-rank matrix X from partial entries, where the rank is known a priori. It is self- 
evident that such entry-wise observations can also be cast as rank-one measurements of X. Therefore, the 
preceding modified TWF may add to recent literature in applying non-convex schemes for low-rank matrix 

or even a broader family of latent-variable models (e.g. dictionary 
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45 , robust PC A 46 


completion 

learning |47 48 , sparse coding |49]7" and mixture problems BEL A concrete application of this flavor is 


a simple form of the fundamental alignment/matching problem |52j 54 . Imagine a collection of n instances, 
each representing an image of the same physical object but with different shift r.j £ {0, • • • ,M— 1}. The 
goal is to align all these instances from observations on the relative shift between pairs of them. Denoting 
by Xi the cyclic shift by an amount ri of Im, one sees that the collection matrix X := [Xj X is a 

rank-Af matrix, and the relative shift observations can be treated as rank-one measurements of X. Running 
TWF over this problem instance might result in a statistically and computationally efficient solution. This 
would be of great practical interest. 
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A Proofs for Section [5] 

A.l Proof of Lemma |3] 

First, we make the observation that (ajz ) 2 — (ajx) 2 = (2 ajz — ajh) ajh is a quadratic function in ajh. 
If we assume 7 < , then on the event £\ one has 


(«i z ) > Q, Nil ' N z \ > 7 Nil ^ 


( 112 ) 


Solving the quadratic inequality that specifies T> 1 gives 


aj h G 

aJz - \J NN) 2 + 7 Nil \ a 7 z 

, a jz~\ 

/( a Tz) 2 - 7 |NII 

a J z \ 

or aj h G 

a l z + \Ji a J z ) 2 -7 Nil \ a J z 

> a J z + \ 

!(aJ z ) 2 + 7 Nil 

aj Z \ 


which we will simplify in the sequel. 

Suppose for the moment that aj z > 0, then the preceding two intervals are respectively equivalent to 


a. h G 


ajh - 2aJz G 


- 7 INI \ a J z \ 


7 INI \ a l z \ 


_ a J z + \J ( a J z ) 2 + 7 INI | a N| a N + VW z ) 2_ 7N 




- 7 Nil N7*| 


7 NII N7*l 


L a N 


,) 2 - 7 | 


s ) 2 + 7 NII 


—V- 

—h. 


Assuming (112) and making use of the observations 


a 7 z +v( a 7 z ) -7lNII|aH 


and 




< 


> 


7 INI |“7*l 


= 7 INI 


7 Nil 

a 

7*1 


(1 + N2) 

K 

zj 


Nil 


a i z + \/ (a- z ) +7INII I“i r2: | 

we obtain the inner and outer bounds 

±(1 + n/2) _ 1 7 |N|] C I U I 2 c [ ± 7 Nil ] ■ 

Setting 71 := gives 

(D^n^r) u(v%n£i,i) c c 

Proceeding with the same argument, we can derive exactly the same inner and outer bounds in the regime 
where aj z < 0 , concluding the proof. 


A.2 Proof of Lemma 0] 

By homogeneity, it suffices to establish the claim for the case where both h and z are unit vectors. 


29 













































Suppose for the moment that h and 2 : are statistically independent from {a^}. We introduce two auxiliary 
Lipschitz functions approximating indicator functions: 


'1, if \ t \ G [VhOla?, V(h99a" b ] ; 

-100 (a" b ) -2 t 2 + 100, if |t| G [VoMaf 3 , af] ; 

T ' 100 (a lb )~ 2 r 2 - 100, if |r| G [al b , V / F01a lb ] ; 

. 0, else. 


fl, if |r| G [0,C0.99 7 ] ; 

Xh(r)~l^T 2 + 100, if |t|g[V09&7,7]; 

[0, else. 

Since h and 2 : are assumed to be unit vectors, these two functions obey 

0 < Xz (aj z ) < l £ i , and 0 < \h (aj h) < 

and thus, 


m m 

~Y,( a ^ h y 1 £{nv^ > -J2{aJ h ) 2 Xz {aJz) X h(aJ h ). 


(113) 


(114) 


(115) 


(116) 


i= 1 


i—1 


We proceed to lower bound A ( a Jh) 2 Xz [ojz) \h (ajh). 

Firstly, to compute the mean of ( ajh) 2 Xz(o-J z )Xh{^J d), we introduce an auxiliary orthonormal matrix 


U z = 


whose first row is along the direction of z, and set 


z T /\\z\ 


h := U z h , and d, := U z ai. 


(117) 


(118) 


Also, denote by a^i (resp. hi) the first entry of di (resp. h ), and dj \i (resp. h\ 1 ) the remaining entries of 
a,i (resp. h ), and let £ ~ A/”(0,1). We have 


E 


ijh) X z {aj z) Xh (aj h) 


> E[{aJh) 2 Xz (ajz)] -E (ajh) (l - Xh (ajh)) 


> E 


,ihi) 2 Xz(aJz) + E (dJ M h v ) 2 E [ Xz (ajz)] - \\h\\ 2 E ^ 2 1{ K | >V ^99 7 } 


> i^ra-Cij + iivii (i-co-Giifc 

> (1-Ci-C 2 )ll^l| 2 , 


(119) 


by y 2 ||d|| 2 , it is a sub-Gaussian random variable with sub-Gaussian norm 0( 7 2 ||d||^). Apply the Hoeffding- 
type inequality |40j Proposition 5.10] to deduce that for any e > 0, 

1 m 

— J2( a J h )\z{aJz)xh(aJh)>E(aJh) 2 Xz{aJz)xh(aJh) -e||d|| 2 (120) 

> (1 - Cl - C 2 - e) ll^ll 2 (121) 


where the identity (1191 arises from (66) and (67). Since (ajh) Xz { a J z ) Xh (o-Jh) is bounded in magnitude 




with probability at least 1 — exp(—fl(e 2 m)). 

The next step is to obtain uniform control over all unit vectors , for which we adopt a basi c ve rsion of an 


e-net argument. Specifically, we construct an e-net J\f e with cardinality |A/" e | < (1 


2/e) 2 " (cf. 


40 


) such that 
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for any (h, z) with \\h\\ = ||z|| = 1, there exists a pair ho, z o G A4 satisfying \\h — ho|| < e and ||z — zo\\ < e. 
Now that we have discretized the unit spheres using a finite set, taking the union bound gives 


1 


m 


E ( a Jho) 2 Xz (ajz 0 ) Xh (ajho) > (1 - Cl - C 2 - e) ||^o|| 2 , Vh, 0 ,z 0 G J\f e 


( 122 ) 


with probability at least 1 — (1 + 2/e) 2n exp(—fl(e 2 m)). 

Define /i(-) and / 2 (-) such that /i(r) := T\ 7 l (y / r) and / 2 (r) := Xz(Vt)i which are both bounded functions 
with Lipschitz constant 0(1). This guarantees that for each unit vector pair h and z, 


{aJ h) Xz (aJz) Xh ( aJh) - (aJ h Q ) Xz ( aJ z 0 ) Xh ( aJh 0 ) 

< I Xh (aJ z) | ■ | (aJ h) 2 Xh (aJ h) - (aJ h 0 ) 2 Xh (aJ h 0 ) \ 

+ I (aJhofxh ( aJh 0 ) \ ■ \Xh ( aJz ) - Xh (aJz 0 ) \ 

< I Xh ( a Jz ) | • \fi(\ajh\ 2 ) - / 1 (|a7^o| 2 )| 

+ \( a Jho) 2 Xh [a]ho) | • |/ 2 (\ajz\ 2 ) - f 2 (|a7^o| 2 ) | 

< | (aJh) 2 - (aJh 0 ) 2 1 + (aJ z) 2 - {aJ z 0 ) 2 1. 

Consequently, there exists some universal constant C 3 > 0 such that 

^ m ^ m 

— ^2( a J h ) Xz (aJz) X h (aJh) -(aJh 0 ) Xz (aJz 0 ) Xh (aJh 0 ) 

1 i =1 1 i=1 


< 


(i) 


1 


1 


— \\A(hh T - h 0 ho ) H- \\A(zz T - z 0 zj)\\ 

m M'- 7 1 ffl v /hi 


V) r,, t 
< C 3 1 — foofy 


<>" 11F ' 


| ZZ T - ZqZq I 


(ii) 


< 2.5C3 {\\h — ho\\ • ||/i|| + ||z — z 0 | 


|} < 5c 3 e, 


where (i) results from Lemma [lj and (ii) arises from Lemma [ 2 ] whenever e < 1/2. 

With the assertion p2) in place, we see that with high probability, 

m 

— J2( a J h ) 2 Xz(aJz)xh{aJh) > (1 - Ci - C 2 - (5c 3 + 1) e) ||h|| : 

7TI 

i=l 


for all unit vectors h and z. Since e can be arbitrary, putting this and (1161 together completes the proof. 


A.3 Proof of Lemma [5] 

The proof makes use of standard concentration of measure and covering arguments, and it suffices to restrict 
our attention to unit vectors h. We find it convenient to work with an auxiliary function 

(M 3 , ifM< 7 2 , 

X2 (r) = < —7 (|r| -y 2 ) +7 3 , if 7 2 < |r| < 2q 2 , 

[0, else. 

Apparently, xi ( r ) is a Lipschitz function of r with Lipschitz norm O (7). Recalling the definition of Xfy 1 , 
we see that each summand is bounded above by 

\ajh \ 3 l v i,i < X 2(\ajh\ 2 ). 

For each fixed h and e > 0, applying the Bernstein inequality |40, Proposition 5.16] gives 


1 m 1 m 

sEWN’v s s 5 >(K h l 2 ) 


< E 


i= 1 


i =1 


X2 


(io, T ^r) 


< E[|a7^| 3 ]+ e = \/8/7T + 1 
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with probability exceeding 1 — exp (—fl (e 2 ?n)). 

Lemma 5.2], there exists an e-net A f e of the unit sphere with cardinality \J\f e \ < (l + 2 )". For 


From 


40 


each h, suppose that ||ho — h|| < e for some ho £ J\f e . The Lipschitz property of \2 implies 


1 

m 


m m 

J2{X2 (\ajh\ 2 ^j - X 2 (|a^o| 2 )} < ~f^\\ajh 


i=l 


(i) 


i= 1 

l^-M \M 


- \a ho 


e, 


where (i) arises by combining Lemmas [l] and [2] This demonstrates that with high probability, 

m 1 m 

— mi a ^| 3 - —' 52 x 2 (\ajh\ 2 ) < y/8Jn + 0{e) 

m X. - J 1 I ^7 tit Z- J ' 


i—1 


2—1 


for all unit vectors h , as claimed. 


A.4 Proof of Lemma |6] 

Without loss of generality, the proof focuses on the case where ||/i|| = 1. Fix an arbitrary small constant 
<5 > 0. One can eliminate the difficulty of handling the discontinuous indicator functions by working with 
the following auxiliary function 


i> if > V’lb (7); 

X3 (r, 7) := { ^777) - 99, if y/r £ [VOMipu, (7), Vlb (7)] 

else. 


0, 


(123) 


Here, 0ib (•) is a piecewise constant function defined as 

I log 7 

■01b (7) := (1 + < 5 ) L lo s(S+a)J } 

which clearly satisfy < if )ib (7) < 7. Such a function is useful for our purpose since for any 0 < 8 < 0.005, 

|a, T b.| > 7 } — X.3 ( | O,; ? 7) — -1| |a] r fr.|>\/oi99i/ , lb(7)} — { | aj h| > 0 . 997 } ' (124) 

For any fixed unit vector h , the above argument leads to an upper tail estimate: for any 0 < t < 1, 


'{x3(\ajh\ 2 , 7) > tj 


< 


| aj h. | >0.997} — — ^ { 1 {|a i T h|>0.997} — 

r°° 2 

2/ c/>{x)dx <——(j) (0.997), 

J 0.997 0.997 


(125) 


where cf>(x ) is the density of a standard normal, and (1251 follows from the tail bound JJ° cf>(x)dx < ( x) 

for all x > 0. This implies that when 7 > 2, both ^(Ja,) h\ 2 ,x) and l|| a T h | >0 9 g 7 | are sub-exponential with 

sub-exponential norm 0{ ”f~ 2 ) (cf. 40 Definition 5.13]). We apply the Bernstein-type inequality for the sum 
of sub-exponential random variables 40 Corollary 5.17], which indicates that for any fixed h and 7 as well 
as any sufficiently small e £ (0,1), 


1 


m 


E X3 (| a 7 ft | >7) < m E 1 {|a^|>0.99 7 } - E 1 {|a i T f l |>0.99 7 } + e ^2 

■i— 1 7 — 1 ' 


s ost exp (- a4972 ) +e 7 


holds with probability exceeding 1 — exp (—fi(e 2 m)). 
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We now proceed to obtain uniform control over all h and 2 < 7 < 2™. To begin with, we consider all 
2 < 7 < m and construct an e-net J\f e over the unit sphere such that: (i) \AF e \ < (l + f) n ; (ii) for any h 
with \\h\\ = 1, there exists a unit vector ho £ J\T e obeying \\h — /iq|| < e. Taking the union bound gives the 
following: with probability at least 1 — ^( 1 +,$) (l + 2 ) n exp(—fi(e 2 m)), 

m 

— ^Zx3(\ajh 0 \ 2 , 7 0 ) < (0.4957o) _1 exp (-0.497^)-feyo -2 




holds simultaneously for all h 0 £ A f e and 70 £ |(1 + <5) 

Note that \3 ( r > 7 o) is a Lipschitz function in r with the Lipschitz constant bounded above by 
With this in mind, for any (h, 7 ) with ||h|| = 1 and 70 := (1 + 6) k < 7 < (1 + <5) fc+1 , one has 


Xs(\ajh 0 \ , 7o)-X3(| ajh\ , 7 ) 


X3(\ajh 0 \ , 7 0 ) -x 3 (|«* T ^| , 7o) 


< 


100 


vfb ( 70 ) 


\a h — \a ho 


It then follows from Lemmas [T]|2] that 


1 

m 


m 

X>s(|a7 


h 


0 , 7o 


m 

J2X3 (\ajh\~, 7 ) 


< 


100 1 

Vfb ( 70 ) m 


A ( 


hh T — hoho 




< 


250(1 + <5 ) 2 


Putting the above results together gives that for all 2 < 7 < (1 + S) lo s( 1 + a > = m, 


1 

m 


m m 

(\ a T h \ 2 ’ 7 ) < (l a ^o| 2 , 7o) + 


250 (1 + 8) 


i —1 


< 


< 


i-1 
1 


exp (-0.497o) + 251 (1 + 6) 2 


0.4957o 

1 


0 . 49 7 exp (‘°' 48572 ) + 251(1 + i)2 7 


with probability exceeding 1 — dgO+a) (■*■ + 7 )" ex P (—ce 2 m). This establishes (711 for all 2 < 7 < m. 
It remains to deal with the case where 7 > m. To this end, we rely on the following observation: 


^ m ^ m 

i=l i—1 


h\ ^ lyM || h ||2 < jL ) Vft. with \\h\\ = 1 , 
m *—' m z m z m 

i—l 


where (i) comes from 16, Lemmas 3.1]. This basically tells us that with high probability, none of the indicator 
variables can be equal to 1. Consequently, A 1 lj| a T h | >m j. = 0, which proves the claim. 


A.5 Proof of Lemma [7] 

Fix 5 > 0. Recalling the notation 17 := 2^ 2 ajh 


‘rcj-g 11 n£j; we see from the expansion (621 that 


-V tr *(z) 

m 


- A 1 v 


<-||A||.||u||<(1 + «5)M 
m Wm 


(126) 


as soon as m > c\n for some sufficiently large ci > 0. Here, the norm estimate ||A|| < \/rn(l + 5) arises 
from standard random matrix results |40, Corollary 5.35]. 
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Everything then comes down to controlling ||u||. To this end, making use of the inclusion (631 yields 
1 \a,Jh\ 2 ' 2 


s »”“ 2 


m ^ V ajz tlMt2 

2=1 x l / 


m 

2 la 


< —V 

m z ' 

2—1 


Jh | + W 


KK\ 


< 


1 


m :=1 
m 


m s 

E{ 4 KK 

2—1 ^ 


^ 4|aTfe | 3 


la 


l«7*l 

( 4 +S 

V a 


l a 7 z l 2 / 


1 £}n(B‘- 1 un*f) 


K\ |a 7 h | 3 / . \ 

r z|/ \ajz\ V £ l n ' D 74 / 

trcitr f not ■wrrtn nrnnciKilihr 1 — 


1 I 

= m £{ 4 ( a K) 2 
2—1 v 

The first term is controlled by | 6 j Lemma 3.1] in such a way that with probability 1 — exp(—12(m)), 

1 m 

-^4(a7/r) 2 <4(l + <5)||/r|| 2 . 

1 i—1 

Turning to the remaining terms, we see from the definition of V *4 and T >* ,a that 

I _T: 


as long as 7 


< 


rv lb ll 


KM < °n^ n K < fl, onfjn ^ 1 

K z l " + on^nDf - \3, on£fnl >} 2 

-. Consequently, one can bound 

iv/EKK KK o , , . . \ 

m ^ \ |a s T z|y Ja^zl V £inv^ 2 J 

5 ™ \ajh\3 

— m ^ |a7z| m ^ |a, T z| f l n:D 7 2 

2 = 1 * 2=1 * 


< 


■fjncy 1 + m 2 ^ 
5(1 + 8) V8/^\\h \\ 3 7 


^ loo (1+<5) l|h| 

where the last inequality follows from (691 and (78 1 . 

Recall that 74 = 3a/!. Taken together all these bounds lead to 

a s) h/itm 1 7 1 i| A | 

’1 <11*11 1001 111 


K » 2 s 


the upper bound 
7 

Too 




whenever -H- < min < 

11*11 _ 1 3a h 


’ 6 


-, +a 41 ^ ~ ’ TT | • Substituting this into (126) completes the proof. 


B Proofs for Section 0 

B.l Proof of Lemma [8] 

Firstly, we collect a few results on the magnitudes of aj x (1 < i < m) that will be useful in constructing 
the hypotheses. Observe that for any given x and any sufficiently large m, 


• I T I 1 

nun a* x\ > —-- 

1 <i<m' 1 m log TO 


> 1 - 


a7 x\ > 


1 


TO log TO 


^/2^^ TO log TO 


> l-o(l). 
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Besides, since E 


^2tt 5 log m — 5 log ? 


-, applying Hoeffding’s inequality yields 


ELw 

EL (i {k 

Sexp| - n (bgL;))' 


= 1 - L {|a7 :c |< J|L} > 41og 
1 

m 


m 
-E 


& L" L {|“L a =|< y > 20log to 


To summarize, with probability 1 — o(l), one has 

mini<j< TO |oL*) > 

E m 

i =l 


< 


mlogm 

to 

4 log TO 


:= k. 


(127) 

(128) 


In the sequel, we will first produce a set A4i of exponentially many vectors surrounding x in such a way 
that e very pair is separated by about the same distance, and then verify that a non-trivial fraction of J\4± 
obeys (1031. Without loss of generality, we assume that x takes the form x = [b, 0, • • • , 0] T for some b > 0. 


The construction of Ai i follows a standard random packing argument. Let w = ,w„] be a 

random vector with 

Wi = Xi~\ — -7=Zi, 1 <i<n, 

V2 n 

where z, n ~' Af (0, 1). The collection A4i is then obtained by generating M\ = exp independent copies 
(1 <l< Mi) of w. For any w l ' l \ w 1 ^' 1 £ Mi, the concentration inequality j40, Corollary 5.35] gives 

P {0.5 yfn — 1 < y/n\\w l ' l ' > — uk || < 1.5 y/n + l} > 1 — 2 exp (—n/8); 

P {0.5 yfn — 1 < v / 2n||'k^ — x|| < 1.5 yfn +1} >1 — 2 exp (—n/8). 


Taking the union bound over all ( A { J ) pairs we obtain 

0.5 — n -1 / 2 < ||u;(0 — ^0')|| <1.5 + n -1 / 2 , \/l ^ j 

l/V8-(2n)- 1 / 2 < ||uk — as|| < -^9/8 + (2n) -1 / 2 , 1 < l < Mi 

with probability exceeding 1 — 2M 2 exp (— |) > 1 — 2 exp (— {{jV 


(129) 


The next step is to show that many vectors in Mi satisfy (1031. For any given w with r := w — x, by 

T 


letting a it ± := [a ij2 , ■ • ■ , a ijTl ] , ry := n, and rp := [r 2 , • • • , r n \ , we derive 

\ajr\ 2 < 2|«Xj > i7’||| 2 + 2\aJj_r±\ 2 < 2|ry| 2 2|qT ± r ± | 2 2||r|| 2 


k,i| 2 |M | 2 


^l| 2 Ki\ 2 \\x\\ 2 - ||x|| 2 


2\aJ ± r ± \ 2 

ki| 2 Nf 


(130) 


It then boils down to developing an upper bound on ■ This ratio is convenient to work with since 

the numerator and denominator are stochastically independent. To simplify presentation, we reorder {a,} 
in a way that 

(to log to) 1 ||x|| < |a7*| L |a;| < • • • < |a„s|; 
this will not affect our subsequent analysis concerning aj ± r± since it is independent of ajx. 

To proceed, we let consist of all but the first entry of — x , and introduce the indicator variables 


3 == 


(hi 

T (0 1 

a 7,± r ±\ 

lh 

It (01 
<i» , x 1 


1 < i < k, 
i > k, 


(131) 
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where k = 41 ™ m as before. In words, we divide ajj_r ^, 1 < i < m into two groups, with the first group 
enforcing far more stringent control t han t he second group. These indicator variables are useful since any 
w"' obeying ]/[/* £/ = 1 will satisfy (1031 when n is sufficiently large. To see this, note that for the first 
group of indices, = 1 requires 


T (i) 

\± r Y 


1 n — 1 2 yjn — 1 n Min 3 ,, /»„ . , 

< —\ - <- — - T" Z , 1 < i < k, 

m\ 2 n myn-2" 11 m" 11 


(132) 


where the second inequality follows from (1291. This taken collectively with (1271 and (1301 yields 

I 2 


7.Tr*0) I 


12 — 


2 || r ( 0||2 


9 ||_(01 


m 2 log 2 m 


X 


— < 
|2 - 


(2 + 9 log 2 to) ||r ( ^ I 


1 < i < k. 


x 


Regarding the second group of indices, = 1 gives 


T ( l ) 


< 




n\\r 


(01 


i = k + 1, • • ■ , m, 


(133) 


where the last inequality again follows from (129). Plugging (133) and (128) into (1301 gives 

I 2 


7 T T*(0 I 


15- < 


2||r-(0|| 2 17 1111 logn (2 + 171og 3 m) ||r 


(01 


+ 


|x|| /log" TO 


< 


) " — 


i > k + 1. 


Consequently, (1031 is satified for all 1 < i < m. It then suffices to guarantee the existence of exponentially 
many vectors obeying Y\Y=i £,■ = 1 ■ 

Note that the first group of indicator variables are quite stringent, namely, for each i only a fraction 
0(l/m) of the equations could satisfy E,\ = 1- Fortunately, Mi is exponentially large, and hence even 
Mi/m k is exponentially large. Put formally, we claim that the first group satisfies 


Mi k 

SIR a 

1=1 i= 1 


Mi 


(27r) fc ^ 2 (1 + 4y / fc/n) fe / 2 


V 1/271 "to / 


= Ml 


(134) 


with probability exceeding 1 — exp (—17 (fc)) — exp(—Mi/4). With this claim in place (which will be proved 
later), one has 


Mi k 




1 1 

—k = 9 eX P 


Z=1 i—1 


(e 2 m) 


1 k (2 + log to) 
20 n 


n ) > ^ exp 


1 

—1 
25 


when n and m/n are sufficiently large. In light of this, we will let A4 2 be a collection comprising all 
obeying JX/ =1 £( = 1) which has size M 2 > 1 exp (^n) based on the preceding argument. For notational 
simplicity, it will be assumed that the vectors in _A/f 2 are exactly (1 < j < M 2 ). 

We now move on to the second group by examining how many vectors in A4 2 further satisfy 

H? =k+1 & = 1. Notably, the above construction of A4 2 relies only on {aq} i<i<k and is independent of 
the remaining vectors {di}i>k- In what follows the argument proceeds conditional on A4 2 and {a.j}i<i<fc. 
Applying the union bound gives 


E 


sCO-ii 

M 2 m 

< E E 


i=k -\-1 


>2 


- 1, M 2 I 

= ) ^ Pj 3 i (k < i < to) : 


t (/) 

l i,± r ± 


> 


2 (n — 1) log n 


T (0 

a z,± r ± 


> 


2 (n — 1) log n 


J = 1 2 = fc+l 

This combined with Markov’s inequality gives 

,«2 


< M 2 TO^T. 

to 




i=/c+l 


en < 


TO log TO 


M 2 
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with probability 1 — o(l). Putting the above inequalities together suggests that with probability 1 — o(l), 
there exist at least 


( to log „ If m log m\ f 1 \ /n\ 

2 W -J-) U") ^ “■> (so) 

vectors in .M 2 satisfying YliLk+i^l = 1- We then choose A4 to be the set consisting of all these vectors, 
which forms a valid collection satisfying the properties of Lemma 151 

Finally, the only remaining step is to establish the claim ( |134[ ). To start with, consider an n x k matrix 
B := [&i, • • • , bk] of i.i.d. standard normal entries, and let u ~ AT ( 0 , -/„) . Conditional on the {bf s, 


bn, — 


^1 ,u 


bju 

bk,ZL 


. b l u . 


AA ( 0, ~B t B ) . 


For sufficiently large to, one has k = 41 ™ m < \n. Using 40 Corollary 5.35] we get 

1 


-B B-I 

n 


< A\Jkjn 


(135) 


with probability 1 — exp (—f l(k)). Thus, for any constant 0 < e < 4, conditional on {6;} and (1351 we obtain 

1 


n i - 


(136) 


> (27r) 2 det 2 (^ — B T B^j 


b„ eT 


1. x / 1 

exp \ - - 

\n 


bl(^B' B) 'b u )AK 


> 


^ _ fe /» -i _ 1”^ 

(2tt) _ 2 fl + 4 \Jk/n\ 2 / exp ( — - (l — 4i Jk/n\ 'S^b^ v \db. 

v ' J b„ex V 2 V J ' J 


> (27t) 2 (l + 4 s/k/n) 2 (v / 27tto) 


-fe 


(137) 

(138) 


where T := {b | |&j| < to -1 , 1 < i < k} and (1371 is a direct consequence from (1351. 

When it comes to our quantity of interest, the above lower bound (138) indicates that on an event (defined 
via {ot;}) of probability approaching 1, we have 


E 


- M i( 2?r ) 2 (l + 4v^) 2 (y/2nm) k . 


Since conditional on {d^}, f| t=1 are independent across l, applying the Chernoff-type bound 
4.5] gives 

. , . nr k 

, , _ - ['Zn) 2 I I -4- 4 1 / fc / n I I \/ '/Trm. ' 

■d=l-I-j=l 2 


55 


(139) 


Theorem 


t) 2 ^1 + A^/k/nj fs/Znrri) 
with probability exceeding 1 - exp ( - § ^ k/2 ^^—^ /2 ). This concludes the proof. 


B.2 Proof of Lemma [9] 

Before proceeding, we introduce the y 2 -divergence between two probability measures P and Q as 

X 2 (P||Q):=y(^) dQ-1. (140) 

It is well known (e.g. |4l| Lemma 2.7]) that 

KL(P||Q)<log(l + X 2 (P||Q)), (141) 
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and hence it suffices to develop an upper bound on the \ 2 divergence. 

Under independence, for any w o,tfi £ R n , the decoupling identity of the y 2 divergence 41. Page 96] 
gives 

X 2 (P (y | ini) || P (y | w 0 )) = J]E (1 + X 2 (P {Vi I ™i) II P {Vi I ™o))) - 1 


(\a] Wl \ 2 -\ajw 0 \ 2 y 

= ex p E i=1 - 


\ a Jw 0 \ 2 


- 1. 


The preceding identity (1421 arises from the following computation: by definition of y 2 (-||-), 


X 2 (Poisson (Ai) || Poisson (A 0 )) = j ^ ^ ^ } 

l^k=o Ag exp (—A 0 J k\ J 


- 1 


( 


= exp A 0 - 2Ai + 


A? 


{E 


(Af/Ao y 
k=o k\ 




— 1 = exp 




- 1. 


Set r := wx — w 0 . To summarize, 

m 

KL(P (y | w-i) || P(y | w 0 )) < 

m 

< E 


i\ a J w il 2 - \qjw 0 \ 2 y 
\ a J w o\ 2 


r | 2 (2 |a7 w o| + KH) 2 

1=1 \ a J w o\ 2 

- 

V “>o 2 


(142) 


(143) 


(144) 


C Initialization via truncated spectral Method 

This section demonstrates that the truncated spectral method works when m x n, as stated in the proposition 
below. 

Proposition 3. Fix 6 > 0 and x £ R". Consider the model where yi = | o.Aa: | 2 + and ai J\f(0,I). 

Suppose that 

\Vi\ < emax{||a;|| 2 , \ajx\ 2 }, 1 <i<m (145) 

for some sufficiently small constant e > 0. With probability exceeding 1 — exp(—fl(m)), the solution z 
returned by the truncated spectral method obeys 

dist(z^, x) < £||®||, (146) 


provided that m > con for some constant cq > 0. 


Proof. By homogeneity, it suffices to consider the case where ||x|| = 1. Recall from |6 Le mma 3.1] that 
m Sidi \ a J x \ 2 £ [1 ± £]||cc|| 2 holds with probability 1 — exp(—f2(m)). Under the hypothesis (1451, one has 


1 

m 


\v\\i < — 




i —1 


+ \ajx\ 2 ) <e||a:|| 2 +e(l + e) 


< 3er 11 11 


which yields 


X 2 

A o 


1 v •\ rn 1 V ' 

m 2-^1 =l 1 m 


i=i 


^ Hi 

\ a J x \ 2 + — EE 

m 

i=i 


£ [l±4er]||®|| 


with probability 1 — exp(—12(m)). 


(147) 
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Consequently, when \rji\ < e||:c|| 2 , one has 


1 {\{a.J x )i+T H \<al(±j: i y l )} ^ 1 { \a.J ^< 0 % (i £, Vl) + \Vi l} ~ 1 {\aJ ^<(l+4e)a 2 y +e}, 
1 {lW*) J -H»|<«S(iE l !/l)} - 1 {\a.Jx\ 2 <a 2 y {±i: i yi)-\ri i \} ^ 1 {\aj X \ 2 <(l-4 e)ft-e}. 


(148) 


Besides, in the case where |?7i| < e\ajx \ 2 , 

1 {|( 0 T !E )2 +7)i |< a 2(_L £,„,)} > !{(1+e) |aT*P <a= ( i £, VI )} ~ 1 { Ift x| 2 < } ' 


1 {\(a,7 X ) 2 +r li \< a 2 y {± T, t yi)} - 1 {(l-e)\ a J X \ 2 < a l(±J2 lV i)} - 1 {|a,Jx| 2 <^c«=}; 


Taken collectively, these inequalities imply that 

^ m ^ m 

~^2 a i a J (ajx) 2 l{ ]ajx \^ 2 }< Y =< -^a i a7(a7x) 2l {| Q T a! |< 71 }. 

4=1 4=1 


(149) 


(150) 


■ =Y 2 


■=Y 1 


where 71 := maxjy^l + 4e)a 2 + e, and 72 := min{^/(1 — 4e)o ; 2 — s, \J\^t a y}- Letting £ 

A/” (0,1), one can compute 


E [Yd = /3\xx t + ft/, and E [Y 2 \ = fcxx T + ft/, 


(151) 


where ft := E [£ 4 l {m < 7l} ] - E[£ 2 l { , e ,< 7l} ], ft := E[£ 2 l {k |< 7l} ], ft := E[£ 4 1 { , 5 |< 72} ] - E[£ 2 l {k |< 72} ] 
and ft := E[£ 2 1 {| 5 I< 72 }]- Recognizing that a.iaj (aPai) lr \ a T x)\<c\ can rewritten as bibj for some 
sub-Gaussian vector b; := oq [ajx^ l.{j a T. B )| <c \, we apply standard results on random matrices with non¬ 
isotropic sub-Gaussian rows (40| Equation (5.26)] to deduce 

||Y 1 -E[Y 1 ]||<(5, \\Y 2 -E[Y 2 }\\<6 (152) 

with probability 1 — exp (—12 (to)), provided that m/n exceeds some large constant. Besides, when e is 


sufficiently small, one further has ||E[Jft] — E [Ift] || < <5. These taken collectively with (150) give 

||Y - Pixx T - ft/1| < 3(5. (153) 

and taking S, e 


Fix some small S > 0. With (1531 in place, applying the Davis-Kahan sin 0 theorem 
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to be sufficiently small, we obtain 

dist(z, a;) < 5, 

where z is the leading eigenvector of Y. Since ft 0 ) := Aoft one derives 

dist(ft°), x) < dist(Aoft z) + dist(z, x) 

< |Aq — l|+<5 < max{-\/l + 2e — 1, 1 — ftl — 2 e} + <5 


(154) 


as long as m/n is sufficiently large, where the last inequality follows from (147). Picking <5 and e to be 
sufficiently small, we establish the claim. 

□ 


We now justify that the Poisson model ([ 4 ]) satisfies the condition (1451. Suppose that /i,; = (ajx) 2 and 
hence yi ~ Poisson (/Zj). It follows from the Chernoff bound that for all f > 0, 


P (Di - Vi>r) < 


E\e tyi 


exp (fa (e* - 1)) 


exp (t(yi + r)) exp (t(/z* + r)) 
Taking r = 2eft and t = e for any 0 < i < 1 gives 


= exp (fa (e* - t - l) - tr) , 


(i) 


’ (y% - Mi > 2eft) < exp (/x* (e* - t - 1 - 2 it)) < exp (^ ( [t 2 - 2 it)) = exp (-/^e 2 ), 


39 



















where (i) follows since e t < 1 +t +1 2 (0 < t < 1). In addition, for any e > 1, taking t = 1 we get 
P (y» ~ Hi > 2efXi) < exp (/Xi (e 4 — t — 1 — 2ef)) < exp (-0.5/Xje). 

Suppose that ||x|| > logm. When Hi > ||a:|| 2 , setting e = 0.5e < 1 yields 

P (j/i - Hi > em) < exp (—/x,;£ 2 /4) < exp (-ff(e 2 ||a;|| 2 )) = exp (-12(e 2 log 2 to)) . 
When Hi ^5 ||a;|| 2 , letting m = ^x/||a;|| 2 and setting e = e/2 m > e, we obtain 

P {Vi ~ IH> e||*|| 2 ) = P (Vi - Hi > 2e/Xi) < exp (- min{e, 0.5} • e/Xj) 

= exp ^ — 0.5min{e, 0.5}e||a;|| 2 ^ = exp ( — fl (e 2 log 2 to) ). 

In view of the union bound, 

P (3i : r]i > emax{||a;|| 2 , a;| 2 }) < to exp ( — (e 2 log 2 to) ) -+ 0. 


(155) 


Similarly, applying the same argument on —y* we get rji > —emax{||a:|| 2 , \a- a;| 2 } for all i, which together 


with (155) establishes the condition (1451 with high probability. In conclusion, the claim (146) applies to 


the Poisson model. 


D Local error contraction with backtracking line search 

In this section, we verify the effectiveness of a backtracking line search strategy by showing local error 
contraction. To keep it concise, we only sketch the proof for the noiseless case, but the proof extends to the 
noisy case without much difficulty. Also we do not strive to obtain an optimized constant. For concreteness, 
we prove the following proposition. 

Proposition 4. The claim in Proposition^ continues to hold if ah > 6, a)J b — 5, al b < 0.1, a p > 5, and 

IWI/lbll < etr (156) 

for some constant e t r > 0 independent of n and to. 

Note that if ah > 6, a“ b > 5 and a} b < 0.1, then the boundary step size Ho given in Proposition [l] 
satisfies 

0.994 - Ci - Ca - 


,-i'i 


> 0.384. 


2(l.02 + 0.665a/ j 

Thus, it suffices to show that the step size obtained by a backtracking line search lies within (0,0.384). For 
notational convenience, we will set 

p := m -1 V£ t r (z) and := { \ajz\ > a/ b ||z|| and |aPp| < a p ||p||} 
throughout the rest of the proof. We also impose the assumption 

M/\\z\\<e (157) 

for some sufficiently small constant e > 0, so that laPpI / laPzl sma ll for all non-truncated terms. It is 


self-evident from (791 that in the regime under study, one has 
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||p|| > 2 {1.99 - 2 (Cr + Ca) - V^^ h )~ l - o(l)} ||h|| > 3. 
To start with, consider three scalars h, 6, and 5. Setting bg := / 2 -, we get 

2 ,_ (b + S) 2 2 . ,2 f, , m 2 ,_ /■, , u \ 1-21 


(158) 


(6 + h,y log 
(i) 


b 2 


- (b + 6) + b 2 = (b + hf log (1 + b s ) - b~bg 


< (b + h) 2 {bg ~ 0A875bj} - b 2 b s = ((6 + h) 2 - b 2 )bg - 0.4875 (6 + h) 2 b 2 s 
= h6(2 + h/b ) (2 + S/b ) - 0.4875 (1 + h/b) 2 |<5 (2 + S/b)\ 2 


= AhS + 


2h 2 S 2hS 2 h 2 S 2 


, 9 — 0.4875<5 2 I 1 + — 
b z V b 




(159) 
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where (i) follows from the inequality log (1 + x) < x — 0.4875x 2 for sufficiently small x. To further simplify 
the bound, observe that 

8 2 [l+'j) (2+t) >48 2 [l+'-\ + <5 2 I 1 + tV -w 


b / b 


and 


2 hS 2 h 2 S 2 

~U~ + ~b 2“ 


= 1+T -lU 2 - 


Plugging these two identities into (1591 yields 


2 h 2 8 


(1591 < 4hS H---0.95 1+- + 1 \ 5 2 - 0.4875(5 2 1 + 


hY 48 


b b 


/ ..f , n.,2 2/i 2 1<5| 1.9|/i| 2 1.951(5 3 | / h\ 2 

Replacing respectively 6, <5, and h with aj z, raj p , and —ajh, one sees that the log-likelihood li (z) = 
y t log(\aJz\ 2 ) - \ajz\ 2 obeys 


li (z + rp) - li (z) = yi log 


\aj (z + rp ) 


- \aj (z + rp) I + \ajz\ 


< -4T ( ajh ) (ajp) - 1.95r 2 (alp) + 




~l2,; 


2 T (ajh) 2 

ajp\ 1.9 t 2 \ajh\ 


ajz 


ajz 



■ =h,i 



= /4, 


“i P 


1.95r d la- p 


1 - 


ajh, 
aj z 


■=h,, 


The next step is then to bound each of these terms separately. Most of the following bounds are straight¬ 
forward consequences from |6j Lemma 3.1] combined with the truncation rule. For the first term, applying 
the AM-GM inequality we get 

1 JJ 4 t JJ f 3.64 2 , T ,2 I. T , 2 ] 

i— 1 i= 1 v y 


< 


4t (1 + 8) f 3.64 2 2 1 


3.64 


2 M\ + 2 IIpII 


Secondly, it follows from Lemma [4] that 

1 m 1 m 

-J2 l2 >M = - L95r2 -EK T P) 2 1 £ * <-1-95 (l - Ci - C 2 ) 

2 =1 2—1 

}] ’ E [ 1 {|^|<v^oT«^}]} and C 2 := E[f 2 l{|£| >v /o*9a h }]- The third term is 


T 2 \\p\\ 2 ■ 


where Cl : = max{E[£ 2 lj^| 
controlled by 


< V / I.01ai b 




Fourthly, it arises from the AM-GM inequality that 


1 V—A 

< 

m L ' 3 


1.9r 2 a p ||p|| 1 JJ , ■ T , , T 


2=1 


a lb llzll m 


j2\ a J h \\ a Jp\j^ 2 —J2\ 2 \ a i h \ + 0 !°^ 


2=1 


2=1 


1 2 1 


< 


£ r 2 ||p|| 2 . 
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Finally, the last term is bounded by 


1 ™ 1™ 1.95 r 3 

m — m 2-^ \ n l 


a iP / a x 


< 


1.95r 3 a 3 ||p|| 3 i 


W) 3 n*ir 




* ,i n 

P 


Under the hypothesis (1581, we can further derive A I li \ £ ^ < t (1.1 + 5) ||p|| . Putting all the above 
bounds together yields that the truncated objective function is majorized by 

-j m m 

— Y] i ( z + T P) — ( z )} 1 f* < — yy (Il,i + I2,i + I3,i + I4,i + h,i ) If' 

TO ' 3 TO ' 3 

2=1 2=1 

r 2 ||pf +re||p|| 2 


< r (1.1 + 5) ||p|| 2 — 1.95 ( 1 -C 1 -C 2 ) 

= { T (l- 1 + 6) - 1.95 (l - Ci - C 2 ) t 2 + re j 


IIPII 


(160) 


for some constant e > 0 that is linear in e. 

Note that the backtracking line search seeks a point satisfying A {^* ( z + r P) — ( z )} If * > 

||p|| 2 . Given the above majorization (1601, this search criterion is satisfied only if 


t/2 < r (1.1 + 5) - 1.95(1 - Cl - C 2 )r 2 + ' 


or, equivalently, 

0.6 + S + e 

T < -=-— := Tub- 

- 1.95(1 - Cl - C 2 ) 

Taking 5 and e to be sufficiently small, we see that r < r u b < 0.384, provided that ctAf < 0.1, a^ b > 5, 
oth > 6, and a p > 5. 

Using very similar arguments, one can also show that ^ Y^iLi ( z + T P) ~ (2)} If* is minorized by a 

similar quadratic function, which combined with the stopping criterion A Y^Li {U (z + rp) — ^ (z)} lf » > 
|r ||p|| 2 suggests that r is bounded away from 0. We omit this part for conciseness. 
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